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4.1 Introduction 



This chapter discusses theoretical framework and methods for developing knowledge- 
based potential functions essential for protein structure prediction, protein- 
protein interaction, and protein sequence design. We discuss in some details 
about the Miyazawa-Jernigan contact statistical potential, distance-dependent 
statistical potentials, as well as geometric statistical potentials. We also de- 
scribe a geometric model for developing both linear and non-linear potential 
functions by optimization. Applications of knowledge-based potential functions 
in protein-decoy discrimination, in protein-protein interactions, and in protein 
design are then described. Several issues of knowledge-based potential functions 
are finally discussed. 

In the experimental work that led to the recognition of the 1972 Nobel 
prize in chemistry. Christian Anfinsen showed that a completely unfolded pro- 
tein ribonuclease could refold spontaneously to its biologically active conforma- 
tion. This observation indicates that the sequence of amino acids of a protein 
conta ins all of the informa t ion needed to s pecify its three-dimensional struc- 
ture l)Anfinsen et all Il96lt lAnfinsenL Il973|) . The automatic in vitro refold- 
ing of denatured proteins was further confirmed in many other protein sys- 
tems ((janicke. .1987,) . Anfinsen's experiments led to the thermodynamic hy- 
pothesis of protein folding, which postulates that a native protein folds into 
a three-dimensional structure in equilibrium, in which the state of the whole 
protein-solvent system corresponds to the global minimum of free energy under 
physiological conditions. 

Based on this thermodynamic hypothesis, computational studies of proteins, 
including structure prediction, folding simulation, and protein design, all de- 
pend on the use of a potential function for calculating the effective energy of 
the molecule. In protein structure prediction, the potential function is used 
either to guide the conformational search process, or to select a structure from 
a set of possible sampled candidate struc tures. Pote ntial function has been 
developed through an inductive approach ^Sinnt ll99,'^ . where the parameters 
are derived by matching the results from quantum-mechanical calculations on 
small molecules to experimentally measured thermodynamic properties of sim- 
ple molecular systems. These potential functions are then generalized to the 
macromolecular level based on the assumption that the complex phenomena 
of macromolecular systems result from the combination of a large number of 
interactions as found in the most basic molecular systems. This type of poten- 
tial function is often referred to as "physics-base d" , "ser ni-empirical" effective 
potential function, or a fo r ce field llLevitt and W arshel. Il975t IWolvnes et all 
ll995HMomanv et a/.l. [l975t iKarplus and Petskairi990|) . The physics-based po- 
tential functions have been extensively studied ov er the last three decades, 
and has found wide uses in protein folding studies ( Duan and Kollmanl Il998l: 
iLazaridi s and KarphisL l200nl) . Nevertheless, it is difficult to use physics-based 
potential functions for protein structure prediction, because they are based on 
full atomic model and therefore require high computational cost. In addition, a 
physical model may not fully capture all of the important physical interactions. 
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Readers are referred to Chapter 4 for a detailed discussion of physics-based 
potential functions. 

Another type of potential function is developed through a deductive ap- 
proach by extracting the parameters of the potential functions from a database 
of known protein structures ijSipplL Il993|) . Because this approach implicitly 
incorporates many physical interactions (electrostatic, van der Walls, cation- tt 
interactions) and the extracted potentials do not necessarily reflect true en- 
ergies, it is often referred to as "knowledge-based effective energy function". 
In recent past, this approach has quickly gained momentum due to the rapidly 
growing database of experimentally determined three-dimensional protein struc- 
tures. Impressive successes in protein folding, protein-protein docking and pro- 
tein design have been achieved recently using knowledge-based scoring func- 

tions i)Russ and RanganathanLl2002HVenclovas a^J . l2003HHu et 'aIl . l2004tlM'endez et all 
I2OO5I) . In this chapter, we focus our discussion on this type of potential func- 
tions. 



4.2 General framework 

Several different approaches have been proposed to extract knowledge-based 
scoring functions from protein structures. They can be categorized roughly into 
two groups. One prominent group of knowledge-based potent ials are those de- 



rived from statistical analysis of database of protein structures llTanaka and Scheragal 
1976aHMivazawa and JerniganllijSStlSamudrala and MouHIIQQSULu and SkolnicM 



200 In this class of potentials, the interacting potential between a pair 



pared with that in a reference state or a null model ( Mivazawa and Jerniean 


1996 


Samudrala and Mould. I1998HLu and Skolnickll2001; Wodak and Rooman 


1993 


Sippl Il995l iLemer et al\. Il995l IJerniEan and Bahai 


, 1996; 


Simons et al. 


1999? 


). A different class of knowledge-based potentials are based on optimiza- 



tion. In this case, the set of parameters for the potential functions are optimized 
by some criterion, e.g., by maximizing the energy gap betwe en known native 



19921 iMaiorov and Crippenl Il992l; IXhomas and Dil 


. Il996al: iTobi et all hOOOsl 


Vendruscolo and Domanvlll 99811 Vendruscolo et all 


2000aHBastolla et a/.l.l200ll 


Dima et ad l2000HMicheletti et ad l200lUDobbs et adl2002HHu et adl2004l). 


There are three main ingredients for developing 


a knowledge-based poten- 



tial function. We first need protein descriptors to describe the sequence and the 
shape of the native protein structure in a format that is suitable for computa- 
tion. We then need to decide on a functional form of the potential function. 
Finally, we need a method to derive the values of the parameters for the potential 
function. 
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4.2.1 Protein representation and descriptors. 

To describe the geometric shape of a protein and its sequence of amino acid 
residues, a protein is frequently represented by a d-dimensional descriptor c € 
R''. For example, a method that is widely used is to count non-bonded contacts 
of 210 types of amino acid residue pairs in a protein structure. In this case, 
the count vector c £ R'^,d — 210, is used as the protein descriptor. Once 
the structural conformation of a protein s and its amino acid sequence a is 
given, the protein descriptions / : {s, a) i-^ R'' will fully determine the d- 
dimensional vector c. In the case of contact descriptor, / corresponds to the 
mapping provided by specific contact definition, e.g., two residues are in contact 
if their distance is below a cut-off threshold distance. At the residue level, the 
coordinates of of Cq, Cp, or side-chain center can be used to represent the 
location of a residue. At atom level, the coordinates of atoms are directly used, 
and contact may be defined by the spatial proximity of atoms. In addition, 
other features of protein structures can be used as protein descriptors as well, 
including distances between residue or atom pairs, solvent accessible surface 
areas, dihedral angles of backbones and side-chains, and packing densities. 

4.2.2 Functional form. 



The form of the potential function H : M.'^ ^ M. determines the mapping of a 
d-dimensional descriptor c to a real energy value. A widely used functional form 
for protein scoring function H is the weighted linear sum of pairwise contacts 



jTanaka and Scheraealll976aHMivazawa and JerniEanl 


1985t 


Tobi et aZ.ll2000a 


Vendruscolo and Domanvlll998HSamudrala and Moull 


,|l99J 


Lu and Skolnick 



200ll) . The linear sum H is: 



i7(/(s,a)) =i7(c) = i«.c = ^u;,c„ (4.1) 

i 

where "■" denotes inner product of vectors; Ci is the number of occurrence of the 
i-th type of descriptor. As soon as the weight vector w is specified, the potential 
function is fully defined. In subsection 14.4.31 we will discuss a nonlinear form 
potential function. 

4.2.3 Deriving parameters of potential functions. 

For statistical knowledge-based potential functions, the weight vector w for lin- 
ear potential is derived by characterization of the frequency distributions of 
structural descriptors from a database of experimentally determined protein 
structures. For optimized knowledge-based linear potential function, w is ob- 
tained through optimization. We describe the details of these two approaches 
below. 
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4.3 Statistical method 



4.3.1 B ackgr ound 

In statistical methods, the observed statistical frequencies of various protein 
structural features are converted into effective free energies, based on the as- 
sumpt ion that freq uently obs erved structural features corresp ond to low- energy 
states l|Tanaka and Schcragal ligySbtlMivazawa and Jerniga^ ll985: SipDil. ll99(y . 
This is the Boltzmann's principle, an idea first proposed by Tanaka and Sc heraga 
(1976) to estimate potentials for pairwise interaction between amino acids l|Tanaka and Scheragal 
Il976b|) . Miyazawa and Jernigan (1985) significantly extended this idea and 
derived a widely-used statistical potentials, where solvent terms are explicitly 
considered and the interactions b etween amino acids are modeled by contact po- 
tentials. Sippl (1990) and ot hers ijSamudrala and Mould . Ii993|Lu and SkolnickL 
l20niUZhou and ZhoUEool derived distance-dependent energy functions to in- 
corporate both short-range and long-range pairwise interactio ns. The pairwise 

terms were further augme nted by incorporating dihedral angles llNishikaw a and Matsuol 
1993HKocher et adll994 . solvent accessibility and hydrogen-bonding ijNishikawgT^Lnd Matsuol 



1993() . Singh and Tropsha (1996) derived potentials for higher-order interac- 
tions ijSingh et "^. Il996albl) . More recently, Ben-Naim (1997) presented three 
theor etical exarnples to demonstrate the nonadditivity of three-body interac- 
tions ljBen-Naimlll997a|l . Li and Liang (2005) identified three-body interactions 
in native proteins based on an accurate geometric mode l, and quantified syste m- 
atically the nonadditivities of three-body interactions l)Li and Liangl 

4.3.2 Theoretical model. 

At the equilibrium state, an individual molecule may adopt many different con- 
formations or microscopic states with different probabilities. The distribution 
of protein molecules among the microscopic states follows the Boltzmann dis- 
tribution, which connects the potential function H(c) for a microstate c to its 
probability of occupancy 7r(c) . This probability 7r(c) or the Boltzmann factor 
is: 

7r(c) = exp[~H{c)/kT]/Z{a), (4.2) 

where k and T are Boltzmann constant and the absolute temperature measured 
in Kelvin, respectively. The partition function Z{a) is defined as: 

Z(a) = ^cxp[-i/(c)/fcr]. (4.3) 
c 

It is a constant under the true energy function once the sequence a of a protein 
is specified, and is independent of the representation /(s, a) and descriptor 
c of the protein. If we are able to measure the probability distribution 7r(c) 
accurately, we can obtain the knowledge-based potential function H(c) from 
the Boltzmann distribution: 

H{c) = -kTlnn{c) - kT In Z{a). (4.4) 
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The partition function Z{a) cannot be obtained directly from experimental 
measurements. However, at a fixed temperature, Z{a) is a constant and has no 
effect on the different probability of occupancy for different conformations. 

In order to obtain an knowledge-based potential function that encodes the 
sequence-structure relationship of proteins, we have to remove background en- 
ergetic interactions H'{c) that are independent of the protein sequence and the 
protein structure. These gene ric energetic contributions are referred collectively 
as that of the reference state l)SiDDAll990() . An effective potential energy AH{c) 
is then obtained as: 

AH{c) = Hie) H\c) = -kTH^] - fcrin[|^], (4.5) 

where 7r'(c) is the probability of a sequence adopting a conformation specified 
by the vector c in the reference state. Since Z{a) and Z'{a) are both constants, 
—kT\n{Z{a)/Z'{a)) is also a constant that does no t depend on the descriptor 
vector c. If we assume that Z{a) « Z'{a) as in the effective 

potential energy can be calculated as: 

AHic) = -fcrin[^] (4.6) 
7r'(c) 

To calculate 7r(c)/7r'(c), One can further assume that the probability distri- 
bution of each descriptor is independent, and we have 7r(c)/7r'(c) = Iljp^rj]- 
Furthermore, by assuming each occurrence of the i-th descriptor is independent, 
we have JJ^ [-^jj^] = Yli Yl^ [f^l ' where tt^ and tt- are the probabihty of i-th type 
structural feature in native proteins and the reference state, respectively. In a 
linear potential function, the right-hand side of Equation 14.61 can be calculated 
as: 

-kT\n[^]^^kTY^cM^]■ (4.7) 
7r'(c) ^ 

Correspondingly, to calculate the effective potential energy AH{c) of the 
system, one often assumes that AH{c) can be decomposed into various basic 
energetic terms. For a linear potential function, AH{c) can be calculated as: 

AH{c) = J2^H{c,)=Y,c^w,. (4.8) 

i i 

If the distribution of each Ci is assumed to be linearly independent to the others 
in the native protein structures, we have: 

= -A:Tln[^]. (4.9) 

In another word, the probability of each structural feature in native protein 
structures follows the Boltzmann distribution. This is the Boltzmann assump- 
tion made in nearly all statistical potential functions. Finkelstein (1995) sum- 
marized protein structural features which are observed to correlate with the 
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Boltzmann distribution. These include the distribution of residues between the 
surface and interior of globules, the occurrence of various 4>, ■0, y angles , cis and 
trans prolines, ion pairs, and empty cavities in protein globules ( Finkclst ein et al\ 
119951) . 

The probability tt^ can be estimated by counting frequency of the i-th struc- 
tural feature after combining all structures in the database. Clearly, the prob- 
ability TTi is determined once a database of crystal structures is given. The 
probability tt- is calculated as the probability of the z-th structural feature in 
the reference state. Therefore, the choice of the reference state has large effects 
and is critical for developing knowledge-based statistical potential function. 

4.3.3 Miyazawa-Jernigan contact potential 

Because of the importance of the Miyazawa-Jernigan model in developing sta- 
tistical knowledge-based potential and its wide use, we discuss the Miyazawa- 
Jernigan contact potential in details. This also gives an exposure of different 
technical aspects of developing statistical knowledge-based potential functions. 

Residue representation and contact definition. In the Miyazawa-Jernigan 
model, the ^-th residue is represented as single ball located at its side-chain 
center zi. If the Z-th residue is a Gly residue, which lacks a side chain, the 
positions of the C" atom is taken as 2/. A pair of residues (Z,m) are defined 
to be in contact if the distance between their side-chain centers is less than a 
threshold 9 = 6.5 A. Neighboring residues I and m along amino acid sequences 
(I? — m| = 1) are excluded from statistical counting because they are likely to be 
in spatial contact that does not reflect the intrinsic preference for inter-residue 
interactions. Thus, a contact between the Z-th and m-th residues is defined 
using A(/^„): 

. _ / 1' '^i \zi - Zm\ < anA \l - ■m\ > I; 
^(/,m) - ]^ 0, otherwise, 

where \zi — Zm\ is the Euclidean distance between the Z-th and m-th residues. 
Hence, the total number count of contacts of residue type i with residue 
type j in protein p is: 

= E^C.™)' if = or (j,z), (4.10) 

l.m, 
l<rn 

where 1(1) is the residue type of the l-th amino acid residue. The total number 
count of contacts in all proteins are then: 

^(^.j) ^^^(^.j):P' «>J = 1:2,--- ,20. (4.11) 
p 

Coordination and solvent assumption. The number of different types of 
pairwise residue-residue contacts ?Z(i j) can be counted directly from the struc- 
ture of proteins following Equation 14. Ill We also need to count the number of 
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residue-solvent contacts. Since solvent molecules are not consistently present in 
X-ray crystal structures, and therefore cannot be counted exactly, Miyazawa 
and Jernigan made an assumption based on the model of an effective sol- 
vent molecule, which has the volume of the average volume of the 20 types 
of residues. Physically, one effective solvent molecule may represent several 
real water molecules or other solvent molecules. The number of residue-solvent 
contacts q) can be estimated as: 

20 

n{i,o) = li^i - ri(jj) + 2n(is)), (4.12) 

where the subscript represents the effective solvent molecule; the other indice 
i and j represent the types of amino acids; n(i) is the number of residue type i in 
the set of proteins; qi is the mean coordination number of buried residue i, cal- 
culated as the number of contacts formed by a buried residue of type i averaged 
over a structure database. Here the assumption is that residues make the same 
number of contacts on average, with either effective solvent molecules (first term 
in Equation (|4.12ll . or other residues (second term in Equation 14.12|l '). 

For convenience, we calculate the total numbers of residues of residue- 
residue contacts n(^r,r), of residue-solvent contacts n(r,o), and of pairwise contacts 
of any type as follows: 

20 20 20 

i—1 j — ^ i—1 

20 

?^(r,0) = ''^(0,r) = ^ ?i(i,0); "■(■,■) = n(^r,r) + "(r.O) + '^-(O^O)- 

i=l 



Chemical reaction model. Miyazawa and Jernigan (1985) developed a phys- 
ical model based on hypothetical chemical reactions. In this model, residues of 
type i and j in solution need to be desolvated before they can form a contact. 
The overall reaction is the formation of (i, j) contacts, depicted in Figure 4.1a. 
The total free energy change to form one pair of (i, j) contact from fully solvated 
residues of i and j is (Figure 4.1a): 

= i^{iJ) + -^(0,0)) - (-E'(i,o) + -^(i,o))j (4-13) 

where j) is the absolute contact energy between the i-th and j-th types of 
residues, and j) = i) ; £'(i,o) are the absolute contact energy between the 
i-th residue and effective solvent, and -E(i.o) = -^(o.i)! likewise for £'(j.o); -^'(o.o) 
are the absolute contact energies of solvent-solvent contacts (0, 0). 

The overall reaction can be decomposed into two steps (Figure 4.1b). In the 
first step, residues of type i and type j, initially fully solvated, are desolvated 
or "demixed from solvent" to form self-pairs and The free energy 
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(a) 



(1) desolvation 



M) + (M 



'(jj) - -^%o) 



m + (ixj 



CTQ 



to 

B 

>< 

+ 



2e 



LAX 
(b) 



(i.D 



Figure 4.1: The Miyazawa-Jernigan model of chemical reaction. Amino acid residues 

first go through the desolvation process, and then mix together to form pair contact 

interactions. The associated free energies of desolvation e(i i) and mixing e'^i can be 
obtained from the equilibrium constants of these two processes. 



changes e(i^i) and e(jj) upon this desolvation step can be easily seen from the 
desolvation process (horizontal box) in Figure 4.1 as: 



E, 



(0,0) 



2E, 



(*,o); 



-E(o,o) - 2£^(j,o), 



(4.14) 



where E(^i i^,E(^jj^ are the absolute contact energies of self pair (i,i) and (:/, j), 
respectively. In the second step, the contacts in and pairs are broken 
and residues of type i and residues of type j are mixed together to form two 
(z, j) pairs. The free energy change upon this mixing step 2e'^j^-j is (vertical box 
in Figure 4.1): 

H,.)=2%,)-(i?(M)+i?(„-)). (4.15) 

Denote the free energy changes upon the mixing of residue of type i and solvent 
as e'( ■ , We have: 



and — 2e', 



(4.16) 



which can be obtained from Equation 14 . 1 41 and Eg nation 14 . 1 51 aft er substituting 
"j" with "0". Following the reaction model of Figure 4.1b, the total free energy 
change to form one pair of («,j) can be written as: 



2e 



2e', 



(h3) 



2e', 



(ifi) 



2e', 



(4.17a) 
(4.17b) 
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Contact energy model. The total energy of the system is due to the contacts 
between residue-residue, residue-solvent, solvent-solvent: 



20 20 
i=0 j=0; 

4.18) 

20 20 20 ^ ' 

= X] X] + X! -^(^o)'^(i,o) + -E'(o,o)"(o,o) 



i=l j=l; 



Because the absolute contact energies j) is difficult to measure and knowledge 
of this value is unnecessary for studying the dependence of energy on protein 
conformation, we can simplify Equation 14.181 further. Our goal is to separate 
out terms that do not depend on contact interactions and hence do not depend 
on the conformation of the molecule. Equation 14. 181 can be re- written as: 



20 20 



Ec = ^(2£'(,_o) -^^(o,o))9*"w/2- + X!X!^(*.j)"(^j) (4-19a) 

1=0 i=l 

j>i 

20 20 20 

= Xl^(».0*"w/2 + Xl XI ^kj)"(^i) (4.19b) 

i=a i=a j=0; 

3>i 

by using Equation 14.121 and Equation 14.131 Here only the second terms in 
Equation 14. 19al and I4.19bl are dependent on protein conformations. Therefore, 
only either ei^ij) or e'^^ ^-^ needs to be estimated. Since the number of residue- 
residue contacts can be counted directly while the number of residue-solvent 
contacts is more difficult to obtain. Equation 14. 19al is more convenient for cal- 
culating the total contact energy of protein conformations. Both P-(i^j) and 
e'l- ^-^ are termed as effect ive contact energies and their values were reported in 
ijMivazawa and JerniganLfl996(l . 

Estimating effective contact energies: quasi-chemical approximation. 

The effective contact energies e^^ j) in Equation 14. 19al can be estimated in kT 
unit by assuming that the solvent and solute molecules are in quasi-chemical 
equilibrium for the reaction depicted in Figure 4.1a: 

[m(,,j-)/m(.,.)][m(o,o)/m(.,.)] m^^)n^^ 

= - -In 7 -, 77 -, r = - In (4.2U) 

['7i(j^O)/m(-,-)Ji'^0\0)/"^(-,-)J "^(j,0)"^a,0) 

where ni/^ij^, 'm-{i,o)j ^-^d m(o,o) ^-re the contact numbers of pairs between residue 
type i and j, residue type i and solvent, and solvent and solvent, respectively. 
m(.^.) is the total number of contacts in the system and is canceled out. Similarly, 
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e'^j^-j and e'^^ q-j can be estimated from the model depicted in Figure 4.1b: 

2e'(,,) = - In J^^^MIL; (4.21a) 

2e',o) = -InJl^^MlL, (4.21b) 
^ ' ' "1(0,0) 

Based on these models, two different techniques have been developed to 
obtain effective contact energy parameters. Following the hypothetical reaction 
in Figure 4.1(a), e(ij) c an be directly est imated from Equation 14.201 as was 
done by Zhang and Kim l)Zhang and KimL |^00) . Alternatively, one can follow 
the hypothetical two-step reaction in Figure 4.1b and estimate each term in 
Equation 14. 1 7bl for ei^ij) by using Eauation l4.21l Because the second approach 
leads to additional insight about the desolvation effects (e'^^ q-j ) and the mixing 
effects {^'(i j)) ill contact interactions, we follow this approach in subsequent 
discussions. The first approach will become self-evident after our discussion. 

Models of reference state. In reality, the true fraction of contacts of 

^(■^■) 

type among all pairwisc contacts (•, •) is unknown. One can approximate 
this by calculating its mean value from sampled structures in the database. We 
have: 

_ Sp"(»j);p ^ ^ Y.p'niifiy^p ^ TO(o,o) _ '^(o,o);p 

where i and j ^ 0. However, this yields a biased estimation of e'^,- and e{ij)- 
When effective solvent molecules, residues of i-th type and residues of j-th type 
are randomly mixed, e'^^ will not equal to as should be because of differ- 
ences in amino acid composition among proteins in the database. Therefore, a 
reference state must be used to remove this bias. 

In the work of Miyazawa and Jernigan, the effective contact energies for 
mixing two types of residues e',- and for solvating a residue e',- p,-, are estimated 
based on two different random mixture reference states IjMivazawa and Jerniganl 
[i985t). In both cases, the contacting pairs in a structure are randomly permuted, 
but the global conformation is retained. Hence, the total number of residue- 
residue, residue-solvent, solvent-solvent contacts remain unchanged. 

The first random mixture reference state for desolvation contains the same 
set of residues of the protein p and a set of effective solvent molecules. We 
denote the overall number of (i, «),(«, 0), (0, 0) contacts in this random mixture 
state after summing over all proteins as c'^ ^ , c'^ ^ , and c'^q g-j , respectively, c'^^ ^-j 
can be computed as: 

C(vO-E[^l'^]'-(-.);P' (4-22) 

p 2^ 9fe "-fc;p 
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where Miyazawa and Jernigan assumed that the average coordination number 
of residue i in all proteins is qi. Therefore, a residue of type i makes qiUi-p 
number of contacts in protein p. Similarly, the number of (i,0) contacts c'^j q) 
can be computed as: 



-(4,0) 



E 



Qi rii: 



nt. 



(■,0);p- 



(4.23) 



From the horizontal box in Figure 4.1, the effective contact energy e'^^ 
now be computed as: 



26(^,0) = - In 



„2 r' 2 



(z ^ 0). 



(4.24) 



The second random mixture reference state for mixing contains the exact 
same set of residues as the protein p, but have all residues randomly mixed. 

The 



We denote the number of (i,j) contacts in this random mixture as C(i j).p. 
overall number of contacts in the full protein set C(^ij) is the sum of c, 
over all proteins: 



E 



^ '^(i,-);P j^ '^(j:-);P j 



■);p- 



(4.25) 



From the vertical box in Figure 4.1, the effective contact energy e'^^^-j 
be computed as: 



I 



^2 



i or j 0. 



can now 



(4.26) 



The compositional bias is removed by the denominator in Equation 14.261 and 
e'^j i) equals to 0. 

Although can be estimated from Equation (|4.21b(l by assuming that 

^[i 0) = in a reference state, Zhang and DeLisi (1997) simplified the Miyazawa- 
Jernigan process by further assuming that the number of solven t-solvent con- 
tacts in both reference states is the same as in the native state ijZhang et all 

C(o,o) = "(0,0)- (4.27) 

Therefore, c'^g g^ and ?^(o,o) ^-re canceled out in Eauation l4.24l and not needed for 
calculating e'^^ g^ . This treatment systematically subtracts a constant scaling 
energy from all effective energies and should produce exactly the same 

relative energy values for protein conformations as Miyazawa- Jernigan's orig- 
inal work, with the difference of a constant offset value. In fact, Miyazawa 
and Jernigan (1996) showed that this constant scaling energy is the effective 
contact energy Cff between the average residue r of the 20 residue types, and 
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Table 4.1: Contact energi es in kT units: eu for upper half and diagonal and 



for lower half (from i Mivazawa and JerniganL I199(tI) '1 



Cys Met Phe He Leu Val Trp Tyr Ala Gly Thr Ser Asn Gin Asp Glu His Arg Lys Pro 



CyS; 
Met 
Phe 
He 
Leu 
Val 
Trp 
Tyr 
Ala 
Gly 
Thr 
Ser 
Asn 
Gin 
Asp 
Glu 
His 
Arg 
Lys 
Pro 



5.44 - 
0.46: 
0.54- 
0.49- 
0.57 
0.52 
0.30. 
0.64. 
0.51 
0.68 
0.67 
0.69 
0.97 
0.64 
0.91 
0.91 
0.65 
0.93 
0.83 
0.53 



4.99 
5.46 
0.20. 
0.01 
0.01 
0.18 
0.29 
0.10 
0.15 
0.46 
0.28 
0.53 
0.62 
0.20 
0.77 
0.30 
0.28 
0.38 
0.31 
0.16 



5.80- 
6.56- 
7.26 - 

0.06; 

0.03- 
0.10- 
0.00 
0.05 
0.17 
0.62 
0.41 
0.44 
0.72 
0.30 
0.75 
0.52 
0.39 
0.42 
0.33 
0.25 



5.50. 
6.02. 
6.84. 
6.54 - 
0.08: 
0.01. 
0.02 
0.11 
0.05 
0.62 
0.30 
0.59 
0.87 
0.37 
0.71 
0.46 
0.66 
0.41 
0.32 
0.39 



5.83- 
6-41- 
7.28- 
7.04 
7.37 
0.04: 
0.08 
0.10 
0.13 
0.65 
0.40 
0.60 
0.79 
0.42 
0.89 
0.55 
0.67 
0.43 
0.37 
0.35 



-4.96- 

5- 32- 

6- 29- 
6-05- 
6-48- 
5-52 - 

0.11; 

0.23- 
0.08 
0.51 
0.36 
0.55 
0.77 
0.46 
0.89 
0.55 
0.70 
0.47- 
0.33- 
0.31 - 



4.95. 
5.55. 
6.16. 
5.78- 
6.14- 
5.18- 
5.06 . 

0.04; 

0.07 
0.24 
0.37 
0.38 
0.30 
0.19. 
0.30. 
0.00- 
0.08 
0.11. 
0.10. 
0.33. 



4.16- 

4- 91- 

5- 66- 
5-25 
5-67 
4-62- 
4.66- 
4.17 - 
0.09: 
0.20 
0.13 
0.14 
0.17 
0.12 
0.07 
0.25 
0.09 
0.30 
0.46 
0.23 



3.57- 
3-94- 
4.81- 
4.58- 
4.91- 
4.04- 
3.82- 
3.36- 
2.72 - 
0.18. 
0.10 
0.18 
0.36 
0.24 
0.26 
0.30 
0.47 
0.30 
0.11 
0.20 



3.16-3.11 
3.39-3.51- 
4.13-4.28- 
3.78-4.03 
4.16-4.34 
3.38-3.46- 
3.42-3.22. 
3.01-3.01- 
2.31-2.32- 
2.24 -2.08- 
0.10 -2.12 - 
0.14-0.06: 
0.22 0.02 
0.24-0.08 
0.13-0.14- 
0.36-0.22 
0.50 0.16 
0.18-0.07- 
0.03-0.19 
0.13 0.04 



2.86- 
3.03 
4.02- 
3-52- 
3-92- 
3.05- 
2.99- 
2.78- 
2.01- 
1.82- 
1.96 
1.67 - 

0.10; 

0.11- 
0.19 
0.19- 
0.26 
0.01- 
0.15- 
0.14 



2.59- 
2-95- 
3.75- 
3.24- 
3.74 
2.83- 
3.07- 
2.76- 
1-84 
1-74 
l.SS- 
1.58- 
1.68 
0.10. 
0.24. 
0.21 
0.29 
0.02 
0.30 
0.18 



2.85- 
3.30- 
4-10- 
3.67- 
4.04- 
3.07- 
3.11- 
2.97- 
1.89- 
1.66- 
1.90- 
1.49- 
1.71- 
1-54 - 

0.09; 

0.19 
0.31- 
0.26- 
0.46- 
0.08 



2.41- 

2- 57- 

3- 48- 
3-17- 
3.40- 
2.48- 
2.84- 
2.76- 
1.70- 
1.59- 
1.80- 
1-63- 
1-68- 
1.46- 
1-21 - 

0.05; 

0.19- 

0- 91- 

1- 01- 

0-14 



2.27-3.60- 

2- 89-3-98- 

3- 56-4-77- 
3-27-4-14 
3-59-4-54 
2-67-3.58- 
2.99-3-98 
2.79-3.52 
1.51-2.41. 
1.22-2.15- 
1.74-2.42 
1.48-2.11 
1.51 -2.08- 
1.42-1.98- 
1.02-2.32. 
0.91 -2.15- 
0-16 -3-05 - 

1.04 0-14; 

1.28 0-23 
0-07 0-15 



2- 57- 

3- 12- 
3-98- 

3- 63- 

4- 03- 
3-07- 
3-41- 
3.16 
1.83. 
1.72. 
1.90- 
1-62- 
1-64- 

1- 80- 

2- 29- 
2-27- 
2-16- 
1-55 - 

0-24; 

0-05- 



1- 95-3-07 

2- 48-3-45 

3- 36-4-25 
3-01-3-76 
3-37-4-20 
2.49-3.32 
2.69-3.73 
2.60-3.19 
1.31-2.03 
1.15-1.87 
1.31-1.90 
1.05-1.57 
1.21-1.53 
1.29-1.73 
1.68-1.33 
1.80-1.26 
1.35-2.25 
0.59-1.70 
0.12 -0.97 
0.04 -1.75 



sugg ested that ea n) — e.ff being used to measure the stability of a protein struc- 



ture IjMivazawa and .Terniga: 



m g used 



Hydrophobic nature of Miyazawa-Jernigan contact potential. In the 

relation of Eauation l4.17bl e(,j j) = e'^^^ — (e'^j o) o))' Miyazawa-Jernigan 
effective contact energy e{i^j) is composed of two types of terms: the desolvation 
terms e'^,- q-j and e'^^- and the mixing term e'^^ ^-^ . The desolvation term of residue 
type i, that is, —s-'^^q-^ or e(i i)/2 (Figure 4.1), is the energy change due to the 
desolvation of residue i, the formation of the i-i self-pair, and the solvent-solvent 
pair. The value of t his term eg i\l2 should correlate well with the hydrophobic- 
ity of residue type i I Mivazawa and JerniganlllQSSULi et a?J . ll997j) . although for 



charged amino acids this term also incorporates unfavorable electrostatic poten- 
tials of self-pairing. The mixing term e'^^ ^-j is the energy change accompanying 
the mixing of two different types of amino acids of i and j to form a contact 
pair %-j after breaking self-pairs i-i and j-j. Its value measures the tendency 
of different residues to mix together. For example, the mixing between two 
residues with opposite charges are more favorable than mixing between other 
types of residues, because of the favorable electrostatic interactions. 

Important insights into the nature of residue-residue contact interactions 
can also be obtained by a quantitative analysis of the desolvation terms and 
the mixing terms. Among different types of contacts, the average difference 
of the desolvation ter ms is 9 times larger than that of the mixing terms (see 
Table ETTl taken from ijMiv azawa and Jerniganlll996|) ). Thus, a comparison of 
the values of + and e'^. ^-^ clearly shows that the desolvation term 

plays the dominant role in determining the energy difference among different 
conformations. 
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Similar conclusion can be drawn by an eigenvalue decomposition analysis 
of the Miyazawa-Jernigan matrix M, which is mad e up of eg va lues alone, 
without the knowledge of the mixing terms e',. IjLi et all Il997|) . The M 
matrix is a 20 x 20 real symmetric matrix, and thus can be reconstructed based 
on the following spectral decomposition: 



N N 

^ihj) = E ^kVkVkjij = E XkVk{i)vkU): (4-28) 

k=l k=l 

where Afc and is the fc-th largest eigenvalue and the corresponding eigenvector, 
respectively; Vk{i) is the i-th component of the fc-th eigenvector. Li et al. (1997) 
found that there are two dominant eigenvalues Ai and A2, and the corresponding 
two eigenvectors are strongly correlated after a shift and a rescaling operation, 
i.e., V2 = au + Pvi. Here, u is the 1 vector with each component equals to 1 
and a and /3 are scalars. Therefore, M can be well-approximated with only one 
eigenvector Vi corresponding to the largest eigenvalue Ai. For each entry e(i^j) 
of the matrix M ^ we have the following approximation: 

e(ij) « \ivi{i)vi{i) + \2V2{i)v2{j) « cq + ci{qi + Qj) + C2qiqj, (4.29) 

where qi = vi{i), and cq, ci and C2 are constants. To better understand the un- 
derlying physical implications, Eauation l4.29l can be rewritten into the following 
form: 

e{ij) ~ hi + hj - C2{qi - 9^)^/2, (4.30) 

where 

hi = Co/2 + ciqi + (c2/2)(7f . 

He re hi + h j is a s ingle-body term and arc interpreted as the desolvation term 
in I Li et a/.L Il997l) : —C2{qi — qjY is a two-body term and are interpreted as 



the mixing term and the magnitude of the mixing term is significantly smaller 
than that of hi + hj. This result is not surprising and is consistent with the 
original model of Miyazawa-Jernigan contact matrix M , where &(i^j) = e'^^ ^■■j — 

K^.o) +^(j,o))-. 

To summarize, the quantitative analysis of Miyazawa-Jernigan contact en- 
ergies reveals that hydrophobic effect is the dominant driving force for protein 
folding. To a large extent, this conclusion justifies the HP model proposed 
by Chan and Dill (1990) where only hydro phobic interactions a re included in 
studies of simple models of protein folding l)Chan and Dillll990|) . 



4.3.4 Distance dependent potential function 

In the Miyazawa-Jernigan potential function, interactions between amino acids 
are assumed to be short-ranged and a distance cutoff is used to define the 
occurrence of a contact. This type of statistical potential is referred to as the 
"contact potential". Another type of statistical potential allows modeling of 
residue interactions that are distance-dependent. The distance of interactions 
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are usually divided into a number of small intervals or bins, and the potential 
functions are derived by applying Eauation l4.9l for individual distance intervals. 



Formulation of distance-dependent potential functions. In distance- 
dependent statistical potential functions, Eg nation 14 . 91 can be written in several 
forms. To follow the conventional notations, we use (i, j) to represent the fc-th 
protein descriptor Ck for pairwise interactions between residue type i and residue 
type j. From Eauation l4.9l we have: 

ill , 

where (i, j; d) represents an interaction between a specific residue pair (z, j) at 
distance d, AH(i, j; d) is the the contribution from the j) type of residue pairs 
at distance d, 7r(i,j; d) and7r'(j,j; d) are the observed and expected probabilities 
of this distance-dependent interaction, respectively, the observed number 

of d) interactions, n the observed total number of all pairwise interactions 
in a database, n'^^j^j.^^ the expected number of d) interactions when the 
total number of all pairwise interactions in reference state is set to be n. 

Since the expected joint probability 7r'(i,j; d) for the reference is not easy 
to estimate, Sippl (1990) replaces Equation 14. 91 with: 



■k'{i,j I d) n'{i,j I d) 

ill , 



(4.31b) 



where TT{i,j \ d) and 7r'(i,j | d) are the observed and expected probability 
of interaction of residue pairs (i, j) given the distance interval d, respectively; 
n(d) is the observed total number of all pairwise interactions at the distance 
d; n'f^ij.j_^ = '^'{hj I d) ■ n{d) is the expected number of interactions at 
d when the total number of all pairwise interactions at this distance d in the 
reference state is set to n(^j_y There are several variations of potential function 
of this form, includin g the "Knowledge-Base d Potential function" (KBP) by Lu 
and Skolnick (2001) iILu and SkolnickL l200ll) . 

In the work of developing th e " Residue- syecific All-atom Probability Dis- 
criminatory Function" (RAPDF) ijSamudrala and Mould . [r998|) . Samudrala and 
Moult (1998) alternatively replaced Eauation l4.9l with: 



Ai7(z,,; d) = -In^yM = _ ; Mm) 



In 



TT'{d I i,j) TT'{d I i,j) 

"(» j; d) 
^kj;d) 



(4.31c) 
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where 7r(d | and Tr'{d \ are the observed and expected probabihty 
of interaction at the distance d for a given pair of residues respectively; 
n(j_j) is the observed total number of interactions for (i, j) pairs regardless of the 
distance, '^'(j j. ^) = 7r'(d | i, j) -71,(4 is the expected number of interactions 
at distance d when the total number of interactions in the reference state 
is set to n(d)- 

The knowledge-based potential functions of Equation I4.31al I4.31bl and 

can all be written using the unifying formula based on the number counts 
of interactions: 

AH{^,J;d) = -H^^]. (4.32) 

Clearly, the different ways of assigning n'^^j.j^^ make the potential functions 
differ from each other substantially, since the method to calculate ni^^ j.^) is 
essentially the same for many potential functions. In other words, the model of 
reference state used to compute n'^ • ^. is critical for distance-dependent energy 
functions. 



Different models of reference states. Sippl (1990) first proposed the "uni- 
form density" model of reference state, where the probability density function 
for a pair of contacting residues («, j) is uniforml y distributed along the distance 
vector connecting them: 7r'(i,j | d) = TT'{i,j) l|Sippl ITggOl) . Lu and Skolnick 
made use of this type of referenc e state to calculate the expected number of 
interactions at distance d as ijLu and SkolnickLl200l[l : 



n 



Hd)- 



The expected probability tt' is estimated using the random mixture approx- 
imation as: 

T^'ihi) = XiXj, 

where Xi ^-iid Xj ^-re the mole fractions of residue type i and j, respectively. 

Samudrala and Moult (1998) made use of another type of reference state, 
where the probability of the distance b etween a pair of residues (i, 7) being d is 
independent of the contact types l|Samudrala and Moultlll998|) : 

^'(d|j,j)=7r'(d). 

The expected number of interactions at distance d in Equation 14. 31cl be- 
comes: 

where 7r'(r) is estimated from 7r(r): 

7r'(d) = 7r(d) = 



Ideal gas reference state. In the uniform density model of Sippl, the same 
density of a particular residue pair along a line could result from very 
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different volume distribution of pairs in specific regions of the protein. 

For example, one spherical shell proximal to the molecular center could be 
sparsely populated with residues, and another distant shell could be densely 
populated, but all may have the same density of pairs along the same 

radial vector. Zhou and Zhou (2002) developed a new reference state (called 
Dfire for "Distance-scaled, Finite Ideal-gas REferenc e state") where residue s 
follow uniform distribution everywhere in the protein (jZhou and Iholiooi). 
Assuming that residues can be modeled as noninteracting points {i.e., as ideal 
gas molecules), the distribution of interacting pairs should follow the uniform 
distribution not only along any vector lines, but also in the whole volume of the 
protein. 

When the distance between a pair of residues is at a threshold distance 
dg = 14.5 A, the interaction energy between them can be considered to be 
0. Therefore, residue type i and type j form pairs at the distance dg purely 
by random, and the observed number of pairs at the distance dg can be 
considered the same as the expected number of (i, j) pairs at the distance dg in 
the reference state. Denote Vd as the volume of a spherical shell of width Ad at 
a distance d from the center. The expected number of interactions (i, j) at the 
distance d after volume correction is: 

, Vd d Ad 

For a protein molecule, n'^^ ^-^ will not increase as because of its finite 
size. In addition, it is well-known that the volume of protein molecule cannot be 
treated as a solid body, as there are numerous voids and pockets in the interior. 
This implie s that the number d ensity for a very large molecule will also not 
scale as dP l|Liang and Dil]Lf200l() . Zhou and Zhou (2002) assumed that n'^^^.^-^ 
increase in rather than d^, where the exponent a needs to be determined. To 
estimate the a value, each protein p in the database is reshaped into a ball of 
radius CpRg-, p, where Rg. p is the radius of gyration of the protein p, and residues 
are distributed uniformly in this reshaped ball. Here Cp takes the value so that 
in the reshaped molecule, the number of total interacting pairs at dg distance 
is about the same as that observed in the native protein p, namely: 

{id) 

for protein p. Once the value of Cp is determined and hence the effective radius 
CpRg-p for each native protein is known, the number of interacting pairs n(^d) 
at distance d can be counted directly from the reshaped ball. Zhou and Zhou 
further defined a reduced distance-dependent function f{d) = n(^d)/d'^ ^-nd the 
relative fluctuation 6 of f{d): 

d 

where / = /('^)/"'" ^^^^ total number of distance shells, all of 

which has the same thickness, a is then estimated by minimizing the relative 
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fluctuation S. The rationale is that since idealized residues are points and are 
uniformly distribu ted in the reshaped ba ll. 6 should be 0. In their study, a was 
found to be 1.61 l)Zhou and Zhoui l2002() . 



4.3.5 Geometric potential functions. 

The effectiveness of potential function also depends on the representation of 
protein structures. Another class of knowledge-based statistical potentials is 
based on the computation of various geometric constructs that reflect the shape 
of the protein molecule s more accur ately. These geometric constructs in clude the 



1996aHZhene et aZJ.Il997aHCarter Jr. et al. 


l200ll;lKrishnamoorthv and Trooshal 


20031). and the aloha shaoe l|Li et a/.U200J 


; iLi and Liand, l2005b'a'l of the pro- 



tein molecules. Geometric potential functions has achieved significant successes 
in many fields. For example, the potential function developed by McConkey 
et al. is based on the Voronoi diagram of the atomic structures of proteins, 
and is among one o f the best performing; a, tom-level potential functions in de- 
coy discrimination ((McConkev et al\ l200,1) . Because the alpha shape of the 
molecule contains rich topological, combinatorial, and metric information, and 
has a strong theoretical foundation, we discuss the alpha potential functions in 
more detail below as an example of this class of potential function. 

Geometric model. In Miyazawa-Jernigan and other contact potential func- 
tions, pairwise contact interactions are declared if two residues or atoms are 
within a specific cut-off distance. Contacts by distance cut-off can potentially 
include many impla usible non-contact i ng ne ighbors, which have no significant 
physical interaction ijBienkowska et a/Ill999|) . Whether or not a pair of residues 
can make physical contact depends not only on the distance between their center 
positions (such as Cq or C/5, or geometr ic centers of side chain) . but also on the 
size and the orientations of side-chains ijBienkowska et allll999f l. Furthermore, 
two atoms close to each other may in fact be shielded from contact by other 
atoms. By occupying the intervening space, other residues can block a pair of 
residues from direct interacting with each other. Inclusion of these fictitious 
contact interactions would be undesirable. 

The alpha potential solves this problem by identifying interacting residue 
pairs following the edges computed in the alpha shape. Details of alpha shape 
can be found in the Chapter Protein structure geometry" . When the parameter 
a is set to be 0, residue contact occurs if residues or atoms from non-bonded 
residues share a Voronoi edge, and this edge is at least partially contained in 
the body of the molecule. Figure 4.2 illustrates the basic ideas. 

Distance and packing dependent alpha potential. For two non-bonded 
residue balls bi of radius with its center located at Zi and bj of radius rj at Zj , 
they form an alpha contact \ a) if their Voronoi regions intersect and these 
residue balls also intersect after their radii are inflated to ri{a) — {rf + a)^/^ 
and rj{a) = (r| -|-a)^/^, respectively. That is, the alpha contact | a) exists 
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(a) a = 



(b) a = 4 



(c) a = 00 



Figure 4.2: Schematic drawing of the Delaunay complex and the alpha shape of a two- 
dimensional molecule. The Voronoi region of a ball is the set of points closest to it when 
measured in power distance. If two Voronoi regions share a boundary, i.e. if there is a 
Voronoi edge (dashed line), we draw a Delaunay edge (solid line in grey or black) between 
these two Voronoi vertices. A Delaunay edge is therefore the dual of a Voronoi edge. 
All Delaunay edges incident to ball residue bi form the 1-star for bi, denoted as St\{bi). 
When the balls are inflated by increasing the a value, more balls overlap, and more Voronoi 
edges intersect with the balls. Therefore, more dual Delaunay edges are included in the 
alpha shape (shown as black solid line segments), (a) When a = 0.0, the balls are not 
inflated and there is only one alpha edge a.^ 3 between ball 62 and ball 63; (b) When 
Q = 4.0, the balls are inflated and their radii are + 4.0. There are six alpha edges: 
''"0 1 1 ''"0 2 5 °"o 3 5 '^o 4 1 °"o 4 5 '^o 5 1 3"<^ ""6 7- For a ball bi, the set of residue balls connected 
to it by alpha edges are called the near neighbors of the ball. The number of this set of 
residue balls is defined as the degree of near neighbors of the residue ball bi, denoted as pi. 
For example, pg — 5, and — 1; (c) When a — 00, all the Delaunay edges become alpha 
edges [a = 16.0 is used for drawing). Hence, all long-range interactions not intervened by 
a third residue are included. 



when: 

\z, ~Zj\< (r,2 + a)i/2 + (r/ + a)^/^ cr,j e K.^ and \i - j\ > 1. 

We further define the 1-star for each residue baU bi as: Sti{bi) = {{bi, bj) G 
K-a , namely, the set of l-simphces with bi as a vertex. The near neighbors of bi 
are derived from Sti{bi) and are defined as: 

^^a{b^) = {bj\(T,j e ICa}, a = 4.0. 

and the degree of near neighbors pi of residue bi is defined as the size of this set 
of residues: 

p, = \M„{b,)\, a = 4.0. 

The degree of near neighbors pi is a parameter related to the local packing 
density and hence indirectly the solvent accessibility around the residue ball bi 
(Figure 4.2b). A large pi value indicates high local packing density and less 
solvent accessibility, and a small pi value indicates low local packing density 
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Figure 4.3: Non-interacting pairs. (61,64) is considered as a non-interacting pair because 
the shortest length i{i.4) is equal to three, i.e., the interaction between 61 and 64 is blocked 
by two residues 67 and b^. Likewise, (63,66) is considered as a non-interacting pair as well. 

and high solvent accessibihty. Similarly, the degree of near neighbors for a pair 
of residues is defined as: 

P(,j) = \Na{h,bj)\ = \Ma{hi)\ + Wa{bj)\, a = 4.0. 

Reference state and collection of non-interacting pairs. We denote the 
shortest path length between residue bi and residue bj as Lf^i j^, which is the 
fewest number of alpha edges (a = 4) that connects bi and bj. The reference 
state of the alpha potential is based on the collection of all non-interacting 
residue pairs (i, j): 

= 3}; 

Any (i, j) pair in this reference state is intercepted by two residues (Figure 4.3). 
We assume that there is no attractive or repulsive interactions between them, 
because of the shielding effect by the two intervening residues. Namely, residue i 
and residue j form a pair only by random chance, and any properties associated 
with bi, such as packing density, side-chain orientation, are independent of the 
same properties associated with bj. 

Statistical model: pairwise potential and desolvation potential. Ac- 
cording to Equation 14.91 the packing and distance-dependent statistical po- 
tential of residue pair (fc, I) at the packing environment P{k,i) and the distance 
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specified by a is: 



H{k,l,p^k,i) I a) - -A-TM ^^y-"'"'"'"^ ). (4.33) 

(fc,', P(fc,i)) 

Here, 'K(k,i,p^k i) I «) ^he observed probability: 

"(fc,i,P(fc,,),a) , , 

7r(fc,i,p(^,,) I a) = , (4.34) 

where riQ-^i^p^^ is the number of residue pair {k,l) at the packing environ- 
ment P(k,i) and the distance specified by a, and n^a) is the total number of 
residue pairs at the distance specified by a. 7r|^, ^ is the expected proba- 

bility: 

'^{k,i, P(k,i)) 

where n'^^ ^ is the number of residue pair (fc, I) at the packing environment 



Z{k,i) in reference state, and n' is the total number of non-interacting residue 
pairs at the reference state. 

The desolvation potential of residue type k to have p near neighbors H{z \ k) 
is estimated simply by following Equation 1)4.91 : 

I /c) ['^(r,p)/"(r)J 

where r represent all 20 residue types. 

For a protein structure, the total internal energy is estimated by the summa- 
tion of the desolvation energy and pairwise interaction energy in the particular 
desolvated environment: 



H{s,a) = ^H{p I k) ■ n(^k,p) 



(4.37) 



k,p 

+ \ H{k,l,p^kA)\a)-n^kj,p^^^,^,a) 

k,l,pk,i ,a 

4.3.6 Sampling weight of proteins in database 

When developing statistical energy functions using a database consisting of 
many homologous sequences, undesirable sampling biases will be introduced. 
An easy way to avoid such sampling bias is to construct a database of struc- 
tures in which no pair of proteins can have more than 25% sequence identity. 
By this criterion, a structure database may exclude a significant number of 
informative structures, which may be valuable for studying a specific type of 
proteins with very few known structures. An alternative method to avoid such 
sampling bias without neglecting these structures is to introduce weights that 
are properly adjusted for each structure, which may or may not be homologous 
to other structures in the database. 
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A similarity matrix S of all pr oteins in the database can be us ed to decide the 
weight for each protein structure l)Mivazawa and JermganLll996|) . The similarity 
between the A:-th and l-th. proteins is defined by Miyazawa and Jernigan based 
on the result of sequence alignment: 



Ski 



kl 



Lk + Li 
< Ski ^ sik < 1 

Skk — 1 

where 9ki is the number of identical residues in the alignment, Lk and L/ are the 
lengths of sequences k and I, respectively. This similarity matrix S is symmetric 
and composed of real values. It has the spectral decomposition: 



S = ^X,Vivf, 



(4.38) 



where and Vi are the i-th eigenvalue and eigenvector of S, respectively. For 
symmetric matrix, these eigenvectors form an orthonormal base. Because for the 
symmetric matrix S, — Trace(S') = fipj-ot ^^'^ positive semi-definite. 



we have: 



< Aj < nprot, 



(4.39) 



where nprot is the number of proteins included in the database. The value of Xi 
reflects the weight of the corresponding orthogonal eigenvector Vi to the matrix 
S. For the special case where there one distinct sequence, which is completely 
dissimilar to any other Up^ot — 1 sequences in the database, at least one eigen- 
values will be exactly equal to 1 and the corresponding eigenvector represents 
this distinct sequence but contains no information about other sequences due to 
the orthogonality of the eigenvectors of matrix S. In another case when there 
is one set of m sequences which are exactly the same within the group but are 
completely dissimilar to any other riprot — m sequences outside this set, at least 
one eigenvalue will be exactly equal to m and m — 1 eigenvalues will be equal 
to zero. The eigenvector corresponding to the non-zero eigenvalue represents 
the whole group of those m sequences but contains no information about other 
sequences. 

On the basis of these characteristics, Miyazawa and Jernigan (1996) de- 
creased all eigenvalues > 1 to 1 to reconstruct a new weight matrix S' , so that 
redundant information from similar sequences are removed and the weight Wk 
for the kth protein in the database is determined. In another word, we have 
before weighting: 



after weighting, 



Wk = Skk 



Wk = Sfefc 



1 



kk 



(4.40) 



(4.41) 



kk 
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where 

/ A„ ifA, <1; 
' \ 1, if A, > 1. 

Therefore, if and only if a sequence is completely dissimilar to any other se- 
quences (A'; = Xi — 1), the sampling weight for that sequence will be 1. If 
all Uprot sequences in the database are identical, the sampling weights for these 
sequences will be l/ripjot- Generally, sampling weights take a value between one 
and l/jT-prot' about negatively proportional to the number of similar 

sequences. 



4.4 Optimization method 

There are several drawbacks of knowledge-based potential function derived from 
statistical analysis of database. These include the neglect of chain connectivity 
in the reference state, and the problematic implicit assumption of Boltzmann 
distribution (Thomas and Dill, 1996b a; Bcn-Naim, 1997b). We defer a detailed 
discussion to Section 11.7. II 

An alternative method to develop potential functions for proteins is by op- 
timization. For example, in protein design, we can use the thermodynamic hy- 
pothesis of Anfinsen to require that the native amino acid sequence a n mounted 
on the native structure sn has the best (lowest) fitness score compared to a set 
of alternative sequences (sequence decoys) taken from unrelated proteins known 
to fold into a different fold T> = {s7v,£Id} when mounted on the same native 
protein structure s^- 

H{f{sN, a^v)) < H{f{sN,aD)) for all {sn, an) e T>. 

Equivalently, the native sequence will have the highest probability to fit into the 

specified native structure. This is the same p rinciple described in ( Shakhno vich and Gutinl . 
ll993HDeutsch and Kurosk^llQQetlLi et a/.l. ll996i. Sometimes we can further re- 
quire that the score difference must be greater than a constant 6 > ijShakhnovichl 
Il994) : 

H{f{sN, aw)) + b< H{f{sN,aD)) for aU {sn, an) e T>. 

Similarly, for protein structure prediction and protein folding, we require 
that the native amino acid sequence ajv mounted on the native structure 
has the lowest energy compared to a set of alternative conformations (decoys) 
V = {si3,ajv}: 

H{f{sN, aw)) < H{f{sD,aN)) for aU sd G V. 

and 

H{f{sN, ajv)) + 6 < H{f{sD,as)) for aU (s^, aw) e V. 

when we insist to maintain an energy gap between the native structure and 
decoy conformations. For linear potential function, we have: 

w ■ cjy + b < w ■ CD for all cd = /(s/j, a^) (4.42) 
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Our goal is to find a set of parameters through optimization for the potential 
function such that all these inequalities are satisfied. 

As discussed earlier, there are three key steps in developing effective knowledge- 
based scoring function using optimization: (1) the functional form, (2) the 
generation of a large set of decoys for discrimination, and (3) the optimiza- 
tion techniques. The initial step of choosing an appropriate functional form is 
important. Knowledge-Based pairwise potential functions are usually all in 
the form of weighted linear sum of interacting residue pairs. In this func- 
tional form, the weight coefficients are the parameters of the potential func- 
tion, which are optimized for discrimination. This is the same functional form 
used in statistical potential, where the weight coefficients are derived from 
database statistics. The objectives of optimization are often maximization 
of energy gap between native protein and the average of decoys, or energy 
gap bet ween n ative and decoys with lowest score, or the z- s core of the native 
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4.4.1 Geometric nature of discrimination 

There is a natural geometric view of the inequality requirement for weighted 
linear sum scoring functions. A useful observation is that each of the inequalities 
divides the space of R'^ into two halves separated by a hyperplane (Figure 4.4a). 
The hyperplane for Eauation l4.42l is defined by the normal vector (cn — cd) and 
its distance &/||cjv — cd|| from the origin. The weight vector w must be located 
in the half-space opposite to the direction of the normal vector (cjv — cu). This 
half-space can be written as w ■ (cjv — cd) -|- 6 < 0. When there are many 
inequalities to be satisfied simultaneously, the intersection of the half-spaces 
forms a convex polyhedron tedelsb runneil Il987f) . If the weight vector is located 
in the polyhedron, all the inequalities are satisfied. Scoring functions with such 
weight vector w can discriminate the native protein sequence from the set of 
all decoys. This is illustrated in Figure 4.4a for a two-dimensional toy example, 
where each straight line represents an inequality w ■ {cn ~ cd) -I- & < that the 
scoring function must satisfy. 

For each native protein i, there is one convex polyhedron Vi formed by the set 
of inequalities associated with its decoys. If a scoring function can discriminate 
simultaneously n native proteins from a union of sets of sequence decoys, the 
weight vector w must be located in a smaller convex polyhedron V that is the 
intersection of the n convex polyhedra: 

n 
i=l 

There is yet another geometric view of the same inequality requirements. If 
we now regard (cjv — cd) as a point in M'*, the relationship w ■ {cn — cd) + b < 
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for all sequence decoys and native proteins requires that all points {cn ~ Cd} 
are located on one side of a different hyperplane, which is defined by its normal 
vector w and its distance to the origin (Figure 4.4b). We can show that 

such a hyperplane exists if the origin is not contained within the convex hull of 
the set of points {cjv — cd} (see Appendix). 

The second geometric view looks very different from the first view. However, 
the second view is dual and mathematically equivalent to the first geometric 
view. In the first view, a point cn — cd determined by the structure-decoy pair 
Cjv = (•sjv, o,n) and cd ~ (■sjv, lu) corresponds to a hyperplane representing an 
inequality, a solution weight vector w corresponds to a point located in the final 
convex polyhedron. In the second view, each structure-decoy pair is represented 
as a point cm — cn in W^, and the solution weight vector w is represented by a 
hyperplane separating all the points C — {cjv — c/)} from the origin. 



4.4.2 Optimal linear potential function 

Several optimization methods have been applied to find the weight vector w 
of linear scoring function. The Rosenblantt perceptron method works by iter- 
atively updating an ini tial weight vector ioq l)Vendruscolo and DomanviL Il998l: 
iMicheletti etdl l200lh . Starting with a random vector, e.g., Wq = 0, one 
tests each native protein and its decoy structure. Whenever the relationship 
w ■ (cjv — Cjj) + b < is violated, one updates w by adding to it a scaled violat- 
ing vector 77 • (cjv — Cjy). The final weight vector is therefore a linear combination 
of protein and decoy count vectors: 

to = ^ ?7(ca, - Cd) ^ ^ "jvCat - ^ anc^,. (4.43) 

Here M is the set of native proteins, and V is the set of decoys. The set of 
coefficients {a^} U {q!_d} gives a dual form representation of the weight vector 
w, which is an expansion of the training examples including both native and 
decoy structures. 

According to the first geometric view, if the final convex polyhedron V is 
non-empty, there can be infinite number of choices of w, all with perfect discrim- 
ination. But how do we find a weight vector w that is optimal? This depends 
on the criterion for optimality. For example, one can choose the weight vector 
w that minimizes the variance of score gaps between decoys and natives: 



a.Tgyj min ('^ ' ~ '^^^^^ 



■^^{w ■ {cn - Cd)) 



as used in reference (jTobi et all |200o3), or minimizing the Z-score of a large 
set of native proteins, or minimizing the Z-score of the native protein an d a. n 
ensemble of decoys (JChiu and Goldstein, 1998; Mirnv and Shakhnovich. 199fit) . 
or maximizing the ratio R between the width of the distribution of the score 
and the average score difference between the native state and the unfolded ones 
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ijGoldstein et aLlfl992HHao and Scheragalll999aj) . A series of important works 
using perceptron learning and other optimization te chniques iFriedrichs and Wo lvne; 
1989HGoldstein et a;J . ll992HTobi et g/l - bOOOat R^druscolo and Domanvi, 1998; 
Pima et all l2000() showed that effective hnear sum scoring functions can be ob- 
tained. 

There i s anot her optimaUty criterion according to the second geometric view 
||Hu et a/.l . l2004() . We can choose the hyperplane (to, b) that separates the set of 
points {cjv — cjj} with the largest distance to the origin. Intuitively, we want to 
characterize proteins with a region defined by the training set points {cjv — cjj}. 
It is desirable to define this region such that a new unseen point drawn from the 
same protein distribution as {cn — co} will have a high probability to fall within 
the defined region. Non-protein points following a different distribution, which 
is assumed to be centered around the origin when no a priori information is 
available, will have a high probability to fall outside the defined region. In this 
case, we are more interested in modeling the region or support of the distribution 
of protein data, rather than estimating its density distribution function. For 
linear scoring function, regions are half-spaces defined by hyperplanes, and the 
optimal hyperplane {w, b) is then the one with maximal distance to the origin. 
This is related to the novelty detection proble m and single-class support vector 
machine stu died in statistical learn ing theory ijVapnik and Chervonenki"^ ll964L 
^974; Schol kopf and SmolaL l2002|) . In our case, any non-protein points will 
need to be detected as outliers from the protein distribution characterized by 
{cjv — Co}- Among all linear functions derived from the same set of native 
proteins and decoys, an optimal weight vector w is likely to have the least 
amount of mis-labellings. The optimal weight vector w can be found by solving 
the following quadratic programming problem: 

Minimize lll'U^lP (4.44) 

subject to w ■ (cjv - cd) + 5 < for aU TV e TV and D eV. (4.45) 

The solution maximizes the distance 6/| | of the plane {w, b) to the origin. We 
obtained the solution by solving the following support vector machine problem: 

Minimize ^Hifp 

subject to w ■ Cm + d < —1 (4.46) 
W ■ CD + d> 1, 

where d > 0. Note that a solution of Problem 14.4611 satisfies the constraints in 
Inequalities H4.45|l . since subtracting the second inequality here from the first 
inequality in the constraint conditions of H4.46|l will give us w-{c[4 — Cj:,) + 2 < 0. 

4.4.3 Optimal nonlinear potential function 

Optimal linear potential function can be obtained using the optimization strat- 
egy discussed above. However, it is possible that the weight vector w does not 
exist, i.e., the final convex polyhedron V = HiLi T^i ni^Y be an empty set. This 
occurs if a large number of native protein structures are to be simultaneously 
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stabilized against a large number of decoy confor mations, no such potential 
functions i n the linear functional form can be found ijVendruscolo et all . l2000at 
iTobi et ad EoOOal. 

According to our geometric pictures, there are two possible scenarios. First, 
for a specific native protein i, there may be severe restriction from some inequal- 
ity constraints, which makes Vi an empty set. Some decoys are very difficult to 
discriminate due to perhaps deficiency in protein representation. In these cases, 
it is impossible to adjust the weight vector so the native protein has a lower score 
than the sequence decoy. Figure 4.4c shows a set of inequalities represented by 
straight lines according to the first geometric view. In this case, there is no 
weight vector that can satisfy all these inequality requirements. That is, no 
linear scoring function can classify all decoys from native protein. According to 
the second geometric view (Figure 4.4d), no hyperplane can separate all points 
(black and green) {cjv — cd} from the origin, which corresponds to the native 
structures. 

Second, even if a weight vector w can be found for each native protein, i.e., 
w is contained in a nonempty polyhedron, it is still possible that the intersection 
of n polyhedra is an empty set, i.e., no weight vector can be found that can 
discriminate all native proteins against the decoys simultaneously. Computa- 
tionally, the question whether a solu tion weight vector w exists can be answered 
unambiguously in polynomial time ijKarmarkarl Il98-^ . If a large number {e.g., 
hundreds) of native protein structures are to be simultaneously stabilized against 
a large number of decoy conformations (e .g., tens of millions ) , no such potential 
functions can be found computationally l)Vendruscolo et al\ . l2000at iTobi et aH 
l2nf)0a|) . Similar conclusion is drawn in a study for protein design, where it was 
found that no linear potential function can simul taneously discriminate a large 
number of native proteins from sequence decoys 

A fundamental reason for such failure is that the functional form of linear 
sum is too simplistic. It has been suggested that additional descriptors of protein 
structures such as higher order interactions {e.g., t hree-body or four-body con- 
tacts ) should be incorporated in protein description l|Betancourt and Thirumalal 
Il999t iMunson and SinghL Il997t IZheng et all Il997bj) . Functions with polvno- 
mial terms using up to 6 degree of Chebyshev ex pansion has also b een used to 
represent pairwise interactions in protein folding l)Fain et allkOO^ . 

We now discuss an alternative approach. Let us still limit ourselves to pair- 
wise contact interactions, although it can be nat urallv extended to include three 
or four body interactions I Li and LianeL l2005hl) . We can introduce a nonlinear 



scoring function analogous to the dual form of the linear function in Equation 
H4.43|l . which takes the following form: 

H{f{s, a)) = H{c) = ^dK{c, Cd) - ^ ajyif (c, c^), (4.47) 

where ao > and ajv > are parameters of the scoring function to be de- 
termined, and cu = f{s]\f,au) from the set of decoys V = {(sa^, au)} is the 
contact vector of a sequence decoy D mounted on a native protein structure Sj^, 
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and cjv — f{sN,apf) from the set of native training proteins J\f = {{sN,a,N)} 
is the contact vector of a native sequence mounted on its native structure 
stv- In this study, all decoy sequence {od} are taken from real proteins possess- 
ing different fold structures. The difference of this functional form from linear 
function in Equation (|4.43|l is that a kernel function K{x, y) replaces the linear 
term. A convenient kernel function K is: 

K{x,y) = e^ll^^yil^/^'^^ for any vectors x and y <E ■Af[j'D, 

where is a constant. Intuitively, the surface of the scoring function has smooth 
Gaussian hills of height ao centered on the location of decoy protein D, and 
has smooth Gaussian cones of depth aN centered on the location c^r of native 
structures N. Ideally, the value of the scoring function will be —1 for contact 
vectors cn of native proteins, and will be +1 for contact vectors co of decoys. 

4.4.4 Deriving optimal nonlinear scoring function. 

To obtain the nonlinear scoring function, our goal is to find a set of parameters 
{aD,a]\r} such that H(f(sN,aN)) has value close to —1 for native proteins, 
and the decoys have values close to +1. There are many different choices of 
{au, a^}. We use an optimality criterion originally developed in statistical 
learning theory l|Vapmkl . ll9 95; Burgcs, 1998; Scholkoof and Smola, 2002). First, 
we note that we have implicitly mapped each structure and decoy from M.'^^^ 
through the kernel function of K{x,y) = e~ll^~yll Z^"' to another space with 
dimension as high as tens of millions. Second, we then find the hyperplane of 
the largest margin distance separating proteins and decoys in the space trans- 
formed by the nonlinear kernel. That is, we search for a hyperplane with equal 
and maximal distance to the closest native proteins and the closest decoys in 
the transformed high dimensional space. Such a hyperplane can be found by 
obtaining the parameters {an} and {aN} from solving the following Lagrange 
dual form of quadratic programming problem: 

Maximize J^ieAfuv, " 5 EijG^u© yiyjaiaje-H^--^^ H'/^-^' 
subject to < tti < C, 



where C is a regu larizing constant that limits the influence of each misclassified 
prote i n or decoy j Vapnik and Chcr vonenkigl Il964 Il974t IVaonikl Il995t iBurgesl 
Il998t IScholkopf and Smola. 2002] . and j/i = — 1 if i is a native protein, and 
t/i = -1-1 if i is a decoy. These parameters lead to optimal discriminat i on of an 
unseen test set jV apnik and C hervonenkisl Il964 Il974t IVapnikL Il995t iBurgesl 
Il998l IScholkopf and Smola. 200^ . When projected back to the space of K^io, 
this hyperplane becomes a nonlinear surface. For the toy problem of Figure 4.4, 
Figure 4.4d shows that such a hyperplane becomes a nonlinear curve in 
formed by a mixture of Gaussian kernels. It separates perfectly all vectors 
{cjv — cjj} (black and green) from the origin. That is, a nonlinear scoring 
function can have perfect discrimination. 
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4.4.5 Optimization techniques. 

The techniques that have been used for optimizing potential function include 
perceptron learning, linear programming, gradient desc ent, statistical analy- 
sis, and support vect o r machine (Tobi et a/., 2000a; Vc ndruscolo a/.l l2000at 
IXia and Levitl l2000t iBastolla et all l200d l200lt IHu et a/.l.l2004j) . These are 
standard techniques that can be found in optimization and machine learning 
literature. For example, there are excellent linear programming sol vers based 
on si mplex method, as implemented in Clp, Glpk, and lp_SOLVE llBerkelaail 
2004*), and based on interior point method as impleme nted in the B PMD l(Meszaro3. 
1996), the HoPDM and the PCx packages (Czvzvk et fld 120041) . We neglect 
the details of these techniques and point readers to the excellent treatises of 
jPaoadimitriou and Steiglit j.ll998l:IVanderbeill996(l . 

4.5 Applications 

Knowledge-Based potential function has been widely used in the study of protein 
structure prediction, protein folding, and protein-protein interaction. In this 
section, we discuss briefly some of these applications. Additional details of 
applications of knowledge-based potential can be found in other chapters of this 
book. 

4.5.1 Protein structure prediction 

Protein structure prediction is an extraordinarily complex task that involves 
two major components: sampling the conformational space and recognizing the 
near native structures from the ensemble of sampled conformations. 

In protein structure prediction, methods for conformational sampling will 
generate a huge number of candidate protein structures. These are often called 
decoys. Among these decoys, only a few are near native structures that are very 
similar to the native structure. An knowledge-based potential function must 
be used to discriminate the near native structures from all other decoys for a 
successful structure prediction. 

Several decoy sets have been developed which are used as objective bench- 
marks to test if an knowledge-based potential function can successfully identify 
the native and near native structures. For example. Park and Levitt (1996) 
constructed a 4-state-reduced decoy set. This decoy test set contains native and 
near-native conformations of seven sequences, along with about 650 misfolded 
structures for each sequence. The positions of Cq of these decoys were generated 
by exhaustively enumerating ten selectively chosen residues in each protein us- 
ing a 4-state off-lattice model. All other residues were assigned the phi/psi value 
based on the best fit of a 4-statc model to the native chain ijPark and Levitt| . 

A central depository of fo lding decoy conformations is the Decoys R'Us 
I Samudrala and Levittl . Eoool) . See Section lOl for the URL hnks to download 
several folding and docking decoy sets. A variety of knowledge-based potential 
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Table 4.2: Performance of geometric potential on folding and docking decoy discrim- 
in ation. 

t:, , ]. 4-statc-icduced lattice-ssfit fisa-casp3 fisa Imds 
Folding i_ 

docoy Native" 2' Native z Native z Native z Native z 

^"^^^ 7/7 4.46 8/8 7.70 3/3 5.23 3/4 5.42 7/10 1.45 

Rosctta-Bound- Rosetta-Unbound- Rosetta- 

Perturb Perturb Unbound-Global Vakser's Sternberg's 

Doeknig p4jjtj.y(, ^ Native z Native z Native z Native z 
decoy 

sets 50/54 12.75 53/54 12.88 53/54 8.55 4/5 4.45 16/16 4.45 

RDOCK 29/42"^ 

" Number of native structures ranking first, eg. 7/7 means seven out of seven native 
structures have the lowest energy among their corresponding decoy sets. 

2 = E — Enativt^l f^') E and a are the mean and standard deviation of the energy values 
of conformations, respectively. 

Native complex is not included in this docking decoy sets. 32 out of 42 decoy sets 
have at least one near native structures (cRMSD < 2.5 A) in the top 10 structures. 



functions have been de veloped and their performance in decoy discrimination 
have steadily improved l|Zhou and ZhoiA 120021 [Lu and SkolnickL 120011: iLi et all 
l2003l) . 

Figure 4.5 shows an example of decoy discrimination on the 4- state- reduced 
decoy set. This result is based on the residue-level packing and distance- 
dependent alpha potential discussed earlier. For all of the seven proteins in 
the 4-state-reduced set, the native structures have the lowest energy. In addi- 
tion, all of the decoys with the lowest energy are within 2.5 A RMSD to the 
native structure. 

Table 1121 hsts the performance of the geometric potential function in folding 
and docking decoy discriminations. Several studies examine the comparative 
performance of different knowledge-base d potential functions ( Park and Levitt! 
Il996l: IZhou and Zhoul 120021: iGilisl. l2004j) . Such evaluations often are based on 
measuring the success in ranking native structure from a large set of decoy 
conformations and in obtaining a large z-score) for the native protein structure. 
Because the development of potential function is a very active research field, 
the comparison of performances of different potential functions will be different 
as new models and techniques are developed and incorporated. 

Knowledge-based potential function can not only be applied at the end of 
the conformation sampling to recognize near native structures, it can also be 
used during conformation generation to guide the efficien t sampling of protein 
struc t ures. Details of this appl ication can be found in l(,Ternigan and Bahail 
Il996t iHao and Scheragal Il999btl . In addition, knowledge-based potential also 
plays important role in protein threading studies. Chapter 13 provides further 
detailed discussion. 
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4.5.2 Protein-protein docking prediction 



Knowledge-Based potential function can also be used to study protein-protein 
interactions. Here we give an example of predicting the binding surface of 
seven antibody or ant ibody related proteins {e.g., Fab fragment, T-cell receptor) 
ijLi and LianeL EoOSaj) . These protein-protein complexes are taken from the 21 
Capri (Critical Assessment of PRedicted Interactions) target proteins. Capri 
is a community-wide competition de signed to obje c tively assess the abilities in 
protein-protein docking prediction l|Mendez et all . l2005(l . In Capri, a bhnd 
docking prediction starts from two known crystallographic or NMR structures 
of unbound proteins and ends with a comparison to a solved structure of the pro- 
tein complex, to which the participants did not have access. Knowledge-Based 
potential functions, together with geometric complementarity scoring functions, 
can be used to recognize near native docking complexes and to guide the gen- 
eration of conformations for protein-protein docking. 

When docking two proteins together, we say a cargo protein is docked to 
a fixed seat protein. To determine the binding surfaces on the cargo protein, 
we can examine all possible surface patches on the unbound structure of cargo 
protein as candidate binding interfaces. The alpha knowledge-based potential 
function is then used to identify native or near native binding surfaces. To 
evaluate the performance of the potential function, we assume the knowledge of 
the binding interface on the seat protein. We further assume the knowledge of 
the degree of near neighbors for interface residues. 

We first partition the surface of the unbound cargo protein into candidate 
surface patches, each has the same size as the native binding surface of m 
residues. A candidate surface patch is generated by starting from a surface 
residue on the cargo protein, and following alpha edges on the boundary of the 
alpha shape by breadth-first search, until m residues are found (Figure 4.6a). 
We construct n candidate surface patches by starting in turn from each of the n 
surface residue on the cargo protein. Because each surface residue is the center 
of one of the n candidate surface patch, the set of candidate surface patches 
cover exhaustively the whole protein binding interface. 

Second, we assume that a candidate surface patch on the cargo protein has 
the same set of contacts as that of the native binding surface. The degree of 
near neighbors for each hypothetical contacting residue pair is also assumed 
to be the same. We replace the m residues of the native surface with the m 
residues from the candidate surface patch. There are Yi'^b^'m \ different ways to 
permute the m residues of the candidate surface patch, where is the number 
of residue type i on the candidate surface patch. A typical candidate surface 
patch has about 20 residues, therefore the number of possible permutation is 
very large. For each candidate surface patch, we take a sample of 5,000 random 
permutations. For a candidate surface patch SPi, we assume that the residues 
can be organized so that they can interact with the binding partner at the lowest 
energy. Therefore, the binding energy E(SPi) is estimated as: 

E{SP,) = minE{SP,)f^, fc = 1, • • • ,5,000. 
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Table 4.3: Recognition of native binding surface of CAPRI targets. 



Target 


Complex 


Antib 


ody^ 


Antif 


^en 


Rank°^j;„^ 


Overlap*^ 




Overlap 


T02 


Rotavirus VP6-Fab 


1/283" 


0.71 


1/639 


0.68 


T03 


Flu hcmagglutinin-Fab 


1/297 


0.56 


1/834 


0.71 


T04 


Q-amylase-camclid Ab VH 1 


56/89 


0.60 


102/261 


0.03 


T05 


a-amylasc-camclid Ab VH 2 


23/90 


0.57 


57/263 


0.25 


T06 


a-amylase-camclid Ab VH 3 


1/88 


0.70 


1/263 


0.62 


T07 


SpcA supcrantigcn TCR/3 


1/172 


0.57 


1/143 


0.61 


T13 


SAGl-antibody complex 


1/286 


0.64 


1/249 


0.69 



" "Antibody" : Different surface patches on the antibody molecule are evaluated by the 
scoring function, while the native binding surface on the antigen remains unchanged. 
"Antigen" : similarly defined as "Antibody" . 

Ranking of the native binding surface among all candidate surface patches. 
Fraction of residues from the best candidate surface patch that overlap with residues 
from the native binding surface patch. 

The first number is the rank of native binding surface and the second number is the 
number of total candidate surface patches. 



Here E{SPi)^. is calculated based on the residue-level packing and distance- 
dependent potential for the k-th permutation. The value of E{SPi) is used to 
rank the candidate surface patches. 

We assess the statistical potential by taking antibody/antigen protein in 
turn as the seat protein, and the antigen/antibody as cargo protein. The native 
interface on the seat protein is fixed. We test if our statistical potential can 
discriminate native surface patch on the cargo protein from the set of candidate 
surface patches. We also test if the best scored patch resembles the native patch. 
The results are listed in Table IT^ and the predicted antigen-binding interface 
of target T02 is shown in Figure 4.6(b) as an example. For five out of the seven 
protein complexes, we succeeded in discriminating the native patches on both 
the antibody and the antigen. Over 50% of the residues from the best scored 
patch overlaps with corresponding native patch. Our statistical potential does 
not work as well for the target T04 and T05, because the antibodies of these 
two complexes do not use their CDR domains to recognize the antigens as an 
antibody usually docs, and such examples are not present in the dataset of the 
34 antibody-antigen complexes, based on which the alpha potential function 
was obtained. 



4.5.3 Protein design 

Protein design aims to identify sequence s compatible with a given 



Totein fold 
The goal 



but incompatible to any alternative folds 
is to design a novel protein that may not exist in nature but has enhanced or 
novel biological function. Several novel pr oteins have been s uccessfully designed 
in recent years JPahivat and Mavclll997t IhTiI eTaH I2OO0I: iLooeer et all . 120031: 
iKuhlman et alV 2003|) . The problem of protein design is complex, because even 
a small protein of just 50 residues can have an astronomical number of sequences 
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(10^^) This clearly precludes exhaustive search of the sequence space with any 
computational or experimental method. Instead, protein design methods rely 
on potential functions for biasing the search towards the feasible regions that 
encode protein sequences. To select the correct sequences and to guide the 
search process, a design potential function is critically important. Such a scoring 
function should be able to characterize the global fitness landscape of many 
proteins simultaneously. 

Here, we briefly describe the applicati on of optimal no nlinear design po- 
tential function discussed in Section 14.4.31 iIHu et all l2004|) in protein design. 
We aim to solve a simplified protein sequence design problem. Our goal is to 
distinguish each native sequence for a major portion of representative protein 
structures from a large number of alternative decoy sequences, each a fragment 
from proteins of different fold. 

To train the nonlinear potential function, a list of 440 pr oteins was compiled 
from t he 1998 release (Whatif9 8) of the Whatif database ( Vendruscolo et oil 
Using gapless threading ijMaiorov and CrippenLll992t) . a set of 14.080.766 
sequence decoys was obtained. The entries in Whatif99 database that are not 
present in Whatif98 are used as a test set. After clean-up, the test set consists 
of 194 proteins and 3,096,019 sequence decoys. 

To test the design scoring functions for discriminating native proteins from 
sequence decoys, we take the sequence a from the conformation-sequence pair 
{s]sf,a) for a protein with the lowest score as the predicted sequence. If it is 
not the native sequence ajv, the discrimination failed and the design scoring 
function does not work for this protein. 

The nonlinear design scoring function is capable of discriminating all of the 
440 native sequences. In contrasts, no linear scoring function can succeed in 
this task. The nonlinear potential function also works well for the test set, 
where it succeeded in correctly identifying 93.3% (181 out of 194) of native 
sequences in the independent test set of 194 proteins. This compares favorably 
with results obtained using o ptimal linear folding scoring function taken as 
reported in l|Tobi et adl2000a|) . which succeeded in identifying 80.9% (157 out 
of 194) of this test set. It also has better performance than optimal linear 
scoring function base d on calculations using parameters reported in reference 
l)Bastolla et a/.Ll200l|) . which succeeded in identifying 73.7% (143 out of 194) of 
proteins in the test set. The Miyazawa-Jernigan statistical potential succeeded 
in identifying 113 native proteins out of 194) (success rate 58.2%). 

4.5.4 Protein stability and binding affinity 

Because the stability of protein in the native conformation is determined by the 
distribution of the full ensemble of conformations, namely, the partition func- 
tion Z[a) of the protein sequence a, care must be taken when using statistical 
potentials to compare the stabilities of different protein sequences adopting the 
same given conformation as in protein design ( Mi va zawa_and J ernigaiL 199&: 
ISinniL Il99(l) . This issue is discussed in some detail in Subsection 14. 7. l1 
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Nevertheless, it is expected that statistical potential should work well in es- 
timating protein stability changes upon mutations, as the change in partition 
functions of the protein sequence is small. In most such studies and studies us- 
ing physics-based empirical po tential (see Chapter 4 in this book and reference 
l|Bordne r and Abagvanl EoO^ I. good correlation coefficien t (0.6-0.8) between 
rcdictcd a.nd measured stability change can be achieved ( Gilis and RoomanL 



1^ 



1996, 1997: G uerois all.l2002t iBordner and Abagvanll2004HHoppe and SchomburaL 



2005^ .Zhou and Zhoul.l200i^ 

Several studies have shown that statistical potentials can also be used to pre- 
dict q uantitative binding free energy of protein-protein or protein-ligand interac- 
tions dOeWitte and ShakhnovichLll996tlMitchell et a/.l.ll999llMuegge and Martini 
ll999HLiu et adl2004llZhang et allhOO^ . In fact. Xu et al. showed that a sim- 
ple number count of hydrophilic bridges across the binding interface is strongly 
correlated with binding free energies of protein-protein interaction i Xu et oli . 



^23)- This study suggests that binding free energy may be predicted success 
fully by number counts of various types of interfacial contacts defined using some 
distance threshold. Such number count studies provide an excellent benchmark 
to quantify the improvement in predicting binding free energy when using statis- 
tical potentials for different protein-protein and protein-ligand complexes. Sim- 
ilar to prediction of protein stability change upon mutation, knowledge based 
potential function played an important role in a successful study of p r edict- 
ing binding free energy changes upon mutation ijKortemme and Bakeil 120021: 
iKortemme a/.Ll2004| . 

4.6 Online resource 

A list of online sources of decoy data for folding and docking is provided in 
Table 1131 



4.7 Discussion 

4.7.1 Knowledge-Based statistical potential functions 

The statistical potential functions are often derived based on several assump- 
tions: (a) protein energetics can be decomposed primarily into pairwise in- 
teractions; (b) interactions are independent from each other; (c) the partition 
function in native proteins Z and in reference states Z' are approximately equal; 
(d) the probability of occupancy of a state follows the Boltzmann distribution. 
These assumptions arc often unrealistic and raise questions about the validity 
of the statistical potential functions: Can statistical potential functions provide 
energy-like quantities such as the folding free energy of a protein, or th e bind- 
ing free energy of a protein-protein complex I Thomas and Dilll . Il996bt l? Can 



statistical potential functions correctly recognize the native structures from al- 
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Table 4.4: Database of folding and docking decoy sets. 





Type 


URL 


Decoy 'R' Us° 


folding 


|http : / /dd . Stanford . edu/ 


Loop 


folding 


http: //franc isco . compbio .ucsf . edu| 






/~ j acobson/ decoy . htm 


GASP 


folding 


http: //predictioncenter . org/| 


ZDOCK, RDOCK 


docking 


|http : //zlab . bu . edu/~ leely/RDOCKl 






_decoy/ 


Vakser decoy set 


docking 


http: //www.bioinf ormatics .ku. edu/| 






f iles/vaJsser/decoys/ 


Sternberg decoy set 


docking 


http : / /www . sbg . bio . ic. ac. uk/ docking/ | 






all_decoys . html 


Rosetta 


docking, folding 


|http : / / depts . Washington . edu/bakerpg/ | 


CAPRI 


docking 


http : // capr i . ebi . ac . uk/ 



": The database of Decoy 'R' Us contains multiple decoy sets, Single decoy 
sets and loop decoy sets, ^-state-reduced decoy set is included in the multiple 
decoy sets. 



ternative conformations? 

The assumptions of statistical knowledge-based potential functions. 

From Eauation l4.4l we can obtain the potential function H{c) by estimating the 
probability 7r(c). However, we need a number of assumptions for this approach 
to work. We need the independency assumption to have: 

7r(c) = j|7r(c,) = nn^''' 

i i Ci 

where Ci is the number of occurrence of i-th structural feature, e.g., number 
of a specific residue pair contact; tt^ is the probability of i-th structural fea- 
ture in the database. That is, we have to assume that the distribution of a 
specific structural feature is independent and not influenced by any other fea- 
tures, and is of no consequence for the distribution of other features as well. 
We also need to assume that c provides an adequate characterization of protein 
interactions, and the functional form of ti; ■ c provides the correct measurement 
of the energy of the interactions. We further need to assume that the energy 
for a protein-solvent system is decomposable, , i.e., the overall energy can be 
partitioned into many basic energy terms, such as pairwise interactions, des- 
olvation energies. Moreover, the partition functions Z' in a chosen reference 
state are approximately equal to the partition functions Z in native proteins. 
These above assumptions together lead to the Boltzmann assumption that the 
structural features contained in the protein database must be a population cor- 
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rectly sampled under the Boltzmann distribution. That is, for for any protein 
descriptor, we have: 

TTi OC exp( — Wi). 

To calculate tt^ in practice, we have to rely on another assumption that 
all protein structures are crystallized at the same temperature. Therefore, the 
distribution tt^ is reasonably similar for all proteins in the database, and hence 
the frequency counts of protein descriptors in different protein structures can 
be combined by simple summation with equal weight. 

Clearly, none of these assumptions are strictly true. However, the success 
of many applications of using the statistical knowledge-based potentials indi- 
cate that they do capture many important properties of proteins. The question 
for improving statistical potential function is, how seriously each of these as- 
sumptions is violated and to what extent it affects the validity of the potential 
function. A few assumptions specific to a particular potential function (such as 
the coordination and solvation assumptions for the Miyazawa-Jernigan's reac- 
tion model) have been described earlier. Here we discuss several assumptions 
in details below. 

Interactions are not independent. Using a HP (hydrophobic-Polar) model 
on two-dimensional lattice, Thomas and Dill (1996) tested the accuracy of 
Miyazawa-Jernigan contact potentials and Sippl's distance-dependent poten- 
tials. In HP model, a peptide chain contains only two types of monomer: H 
and P. The true energies are set as H^^^.h) — ~1j ^{h.p) — and H(p,p) — 0. 
Monomers are in contact if they are non-bonded nearest neighbors on the lattice. 
The conformational space was exhaustively searched for all sequences with the 
chain length from 11 to 18. A sequence is considered to have a native structure 
if it has a unique ground energy state. All native structures were collected to 
build a structure database, from which the statistical potentials are extracted by 
following the Miyazawa-Jernigan or the Sippl method. The extracted energies 
are denoted as e-i^H,H)TG{H,P)j ^^nd ei^ppy 

It was found that neither of these two methods can extract the correct en- 
ergies. All extracted energies by these two methods depend on chain length, 
while the true energies do not. Using Miyazawa-Jernigan's method, the (iJ, H) 
contact is correctly determined as dominant and attractive. However, the esti- 
mated values for ei^H.P) ^(p,p) ^''^ equal to zero, whereas the true ener- 
gies H(^H p^ and H(^p p^ are equal to zero. Using Sippl's method, the extracted 
potentials erroneously show a distance-dependence, i.e., [H, H) interactions are 
favorable in short-distance but unfavorable in long-distance, and conversely for 
(P, P) interactions, whereas the true energies in the HP model only exist be- 
tween a first-neighbor [H, H) contact, and become zero for all the interactions 
separated by two or more lattice units. 

These systematic errors result from the assumption that the pairwise in- 
teractions are independent, and thu s the volume exclusion in proteins can be 
neglected l|Thomas and Dill Il996b() . However, {H,H) interactions indirectly 
affects the observed frequencies of {H, P) and (P, P) interactions. First, in both 
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contact and distance-dependent potentials, because only a limited number of 
inter-residue contacts can be made within the restricted volume at a given dis- 
tance, the high density of {H, H) pairs at short distances is necessarily coupled 
with the low density (relative to reference state) of [H, P) and (P, P) pairs at 
the same distances, especially at the distance of one lattice unit. As a result, the 
extracted (i?, P) and (P, P) energies are erroneously unfavorable at short dis- 
tance. Second, for distance-dependent potentials, the energy of a specific type 
of pair interaction at a given distance is influenced by the same type of pair 
at different distances. For example, the high density of (77, H) pairs at short 
distances causes a compensating depletion (relative to the uniform density ref- 
erence state) at certain longer distances, and conversely for [H, P) and (P, P) 
interactions. Admittedly this study was carried out using models of short chain 
lengths and a simple alphabet of residues where the foldable sequences may 
be very homologous, hence the observed artifacts are profound, the deficiencies 
of the statistical potentials revealed in this study such as the excluded volume 
eflfect is likely to be significant in potential functions derived from real proteins. 

Pairwise interactions are not additive. Interactions stabilizing proteins 
are often modeled by pairwise contacts at atom or residue level. An assumption 
associated with this approach is the additivity of pairwise interactions, namely, 
the total energy or fitness score of a protein is the linear sum of all of its pairwise 
interactions. 

However, the non- additivity effects have been clearly demonstra ted in clus- 
ter for mation of hydroph obic methane molecules^ both in experiment llBen-NaimL 



imial) a nd in simulation l|R,ank and Bakeil ll997':'Shimizu and ChanLl2nni[l200d: 
Czaolewski et al\ . l200fi). Protein structure refinement will likely require higher 
der interactions l|Betancourt and Thirumalal Il999l). Some three-body con- 



ori ^ 

tacts have been i ntroduced in several studies ("Eastwood and Wolvnesl I2OOII: 
iRossi et al\. l200lt iGodzik and Skolnickl [l992; .Godzik et aL .1992,) , where phys- 
ical models explicitly incorporating three-body interactions are developed. In 
addition, several studies of Delaunay four-body interactions clearly showed the 
importance of including higher order int eractions in explaining the observed 
frequency distribution of residue contacts llKrishnainoorthy and Tropsha ,12003 ; 



Garter Jr. erZ.'200l':'Gan eit a/.Ll200ll:IZheng et ffl/.LIl997aHSingh eit a/.Lfl996'a: 



Munson and Singh, 1997'). 

Li and Liang (2005) introduced a geometric model based on the Delau- 
nay triangulation and alpha shape to collect three-body interactions in native 
proteins. A nonadditivity coefficient V(ij^k) is introduced to compare the three- 
body potential energy ei^i^^k) with the summation of three pairwise interactions 
Cij, e(i_fc), and e(j_fc): 

V(i,j.k) = exp[-e(,j^fc)]/ exp[-(e(jj) -I- e^.^k) + (^(j,k))]- 

There are three possibilities: {1) v = 1: interaction of a triplet type is ad- 
ditive in nature and can be well approximated by the sum of three pairwise 
interactions; {2) v > 1: three-body interactions are cooperative and their as- 
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sociation is more favorable than three independent pairwise interactions; (3) 
u < 1: three-body interactions are anti-cooperative. 

After systematically quantifying the nonadditive effects of all 1,540 three- 
body contacts, it was found that hydrophobic interactions and hydrogen bond- 
ing interactions make nonadditive contributions to protein stability, but the 
nonadditive nature depends on whether such interactions are located in protein 
interior or on protein surface. When located in interior, many hydrophobic in- 
teractions such as those involving alkyl residues are anti-cooperative, namely 
V <1. Salt-bridge and regular hydrogen-bonding interactions such as those in- 
volving ionizable residues and polar residues are cooperative in interior. When 
located on protein surface, these salt-bridge and regular hydrogen bonding inter- 
actions are anti-cooperative with v < 1, and hydrophobi c interactions involving 
alkyl residues become cooperative l)Li and Lianell2n05bl) . 

Sequence dependency of the partition function Z{a). We can obtain the 
total effective energy AE{s, a) given a structure conformation s and its amino 
acid sequence a from Equation 1)4. 5|) : 



where Ci is the total number count of the occurrence of the i-th descriptor, e.g., 
the total number of z-th type of pairwise contact. The summation involving Z{a) 
and Z'{a) is ignored during the evaluation of AH{ci) by assuming Z{a) « Z'{a). 

It is clear that both Z{a) and Z'{a) do not depend on the particular struc- 
tural conformation s. Therefore, the omission of the term of the partition 
functions —kT\n{ z''(a) ) ^^^^ '^^^ affect the rank ordering of energy values of 
different conformations (i.e., decoys) for the same protein sequence. On the 
other hand, it is also clear that both Z{a) and Z'{a) depend on the specific 
sequence a of a protein. Therefore, there is no sound theoretical basis to com- 
pare the stabilities between different proteins using the same knowledge-based 
potential function, unless the ratio of Z{a] I Z' ( a) for each individual sequence 
is known and is included during the eval uation jMivazawa and JerniganI TlQSfit 
ISamudraJa a.nd Mould . IT99ilSinnll .ll 99ft. Notably, Dfire and other statistical 
energy functions have been successful used to predict binding affinities across 
different protein-protein/peptide complexes. Nevertheless, the theoretical basis 
is not sound either, because the values of partition function Z(a)s for different 
protein complexes can be drastically different. It remains to be seen whether 
a similarly successful prediction of binding affinities can be achieved just by 
using the number of native interface contacts at some specific distance interval, 
i.e. the packing density along the native interface. This omission is probably 
benign for the problem of predicting free energy change of a protein monomer 
or binding free energy change of a protein-protein complex upon point muta- 



AH{f{s,a)) = AH{c) = ^ Ai7(c,) 




(4.48) 
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tions, because the distribution of the ensemble of protein conformations may 
not change significantly after one or several point mutations. 

Evaluating potential function. The measure used for performance evalua- 
tion of potential functions is important. For example, z-score of native protein 
among decoys is widely-used as an important performance statistic. However, 
2-score strongly depends on the properties of the decoy set. Imagine we have ac- 
cess to the true energy function. If a decoy set has a diverse distribution in true 
energy values, the z-score of the native structure will not be very large. How- 
ever, this by no means suggests that a knowledge-based energy function that 
gives a larger z-score for native protein is better than the true energy function. 
Alternative measures may provide more accurate or useful performance evalua- 
tion. For example, the correlation r of energy value and CRMSD may be helpful 
in protein structure prediction. Since a researcher has no access to the native 
structure, (s)he has to rely on the guidance of an energy function to search for 
better structures with lower CRMSD to the unknown native structure. For this 
purpose, a potential function with a large r will be very useful. Perhaps the 
performance of a potential function should be judged not by a single statistic 
but comprehensively by a number of measures. 

4.7.2 Relationship of knowledge-based energy functions 
and further development 

The Miyazawa-Jernigan contact potential is the first widely used knowledge- 
based potential function. Because it is limited by the simple spatial description 
of a cut-off distance, it cannot capture the finer spatial details. Several distance- 
dependent potentials have been developed to overcome this limit ation, and in 
general have better performa nce |)Lu and SkolnickLl200lt ls'amudr ala and MouH 
ll998HZhou and ZhoilEool . A major focus of works in this area is the devel- 
opment of models for the reference state. For example, the use of the ideal 
gas as reference state in the potential function Dfire sign ificantly improves th e 
performance in folding and docking decoy discrimination l|Zhang adl2004a|) . 

Because protein surface, interior, and protein-protein interface are packed 
differently, the propensity of the same pairwise interaction can be different de- 
pending on whether the residues are solvent-exposed or are buried. The contact 
potential of Simons et al. considers two types of environm ent, i.e., buried and 
non-buried envi ronments separately JSimons et oil Il999bl) . The geometric po- 
tential function ijLi and LianelEoOSaj) described in Subscction l4. 3 . Sl incorporates 
both dependencies on distance and fine-graded local packing, resulting in sig- 
nificant improvement in performance. Table IT!^ show that this potential can be 
successfully used in both protein structure and docking prediction. Knowledge 
based potential has also been developed to account for the loss of backbone , 
side-chain, and translational entropies in folding and binding ijAmzelL 120001: 
iLee et adll99^ . 

Another emphasis of recent development of potential function is the orienta- 
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tional dependency of pairwise interact ion llKortemme et all l2003t iBuchete et all 
120031 l2004i ImI^ azawa and JernieanL |2005|) . Kortemme et al. developed an 
orientation-dependent hydrogen bonding potential, which im proved prediction 
of pr otein structure and specific protein-protein interactions l|Kortemme et all 
iool. Miyazawa and Jernigan developed a fully anisotropic distance-dependent 
potential, with drastic improvements in decoy discrimination over th e original 
Miyazawa- Jernigan contact potential ijMivazawa and JernieanL l2005(l . 



Computational Efficiency Given current computing power, all potential func- 
tions discussed above can be applied to large-scale discrimination of native or 
near-native structures from decoys. For example, the geometric potential re- 
quires complex computation of the Delaunay tetrahedrization and alpha shape 
of the molecule (see Chapter 7 for details). Nevertheless, the time complexity is 
only ©(A^logiV), where N is the number of residues for residual-level potentials 
or atoms for atom-level potentials. For comparison, a naive implementation of 
contact computing without the use of proper data structure such as a quad-tree 
or k-d tree is ©(iV^). 

In general, atom-level potentials have better accuracy in recognizing native 
structures than residue-level potentials, and is often preferred for the final re- 
finement of predicted structures, but it is computationally too expensive to be 
applicable in every step of a folding or sampling computation. 



Potential function for membrane protein. The potential functions we have 
discussed in Section 3 are based on the structures of soluble proteins. Mem- 
brane proteins are located in a very different physico-chemical environment. 
They also have different amino acid composition, and they fold differently. Po- 
tential functions developed for soluble proteins are therefore not applicable to 
membrane proteins. For example, Cys-Cys has the strongest pairing propensity 
because of the formation of disulfide bond. However, Cys-Cys pairs rarely occur 
in membrane proteins. This and other difference in pairwi se contact propensity 
betw een membrane and soluble proteins are discussed in ijAdamian and LianeL 

Nevertheless, the physical models underlying most potential functions devel- 
oped for soluble proteins can be modified for membrane proteins llAdam ian an d LianeL 
l200lLl2002LlAdamian et aIll2003l:IPa^k et g^llioolljackups Jr and LianeLl2005l). 
For e xample, Sale et al used the Mhip potential developed in ! AdarniaiTand LianeL 
l200l() to predict optimal bundling of TM helices. With the help of 27 additional 
sparse distance constraints from experiments reported in literature, these au- 
thors succeeded in predicting th e structure of dark -adapted rhodopsin to within 
3.2 A of the crystal structure jSale et all |20^. It is likely that statistical 
potentials can be similarly developed for protein-ligand and protein-nucleotides 
interactions using the same principle. 
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4.7.3 Optimized potential function 



Knowledge based potential function derived by optimization has a number of 
characteristics that are distinct from statistical potential. We discuss in detail 
below. 

Training set for optimized potential function. Unlike statistical po- 
tential functions where each native protein in the database contribute to the 
knowledge-based scoring function, only a subset of native proteins contribute. 
In an optimized potential function, in addition, a sma ll fraction of decoys also 
contribute to the scoring function. In the study of ijHu et all |^Q4), about 
50% of native proteins and < 0.1% of decoys from the original training data of 
440 native proteins and 14 million sequence decoys contribute to the potential 
function. 

As illustrated in the second geometric views, the discrimination of native 
proteins occurs at the boundary surface between the vector points and the 
origin. It does not help if the majority of the training data are vector points 
away from the boundary surface. This implies the need for optimized potential 
to have appropriate training data. If no a priori information is known, it is likely 
many decoys (>millions) will be needed to accurately define the discrimination 
boundary surface, because of the usually large dimension of the descriptors for 
proteins. However, this imposes significant computational burden. 

Various strategies have been developed to select only the most relevant vector 
points. Usually, one may only include the most difficult decoys during training, 
such as decoys with lower energy than native structures, decoys with lowest 
absolute energies, and decoys already con t ributing to the potential function i n 
previous iteration l|Micheletti p.t all boOlt Fl hbi p.t all l2nnnbl: 'Hu et al'., '20M. 
In addition, an i terative training proces s is often necessary (|Micheletti et all 
l200lUTobi e7flIl . l2000hl:lHu et a/lliool . 

Reduced nonlinear potential function. The use of nonlinear terms for po- 
tential function involves large datasets, because they are necessary a priori to 
define accurately the discrimination surface. This demands the solution of a 
huge optimization problem. Moreover, the representation of the boundary sur- 
face using a large basis set requires expensive computing time for the evaluation 
of a new unseen contact vector c. To overcome these difficulties, non-linear 
potential function needs to be further simplified. 

One simple approach is to use alternative optimal criterion, for example, by 
minimizing the distance expressed in 1-norm instead of the standard 2-norm 
Euclidean distance. The resulting potential function will automatically have 
reduced terms. Another promising approach is to use rectangle kernels (Hu, 
Dai, and Liang, manuscript). 

Potential function by optimal regression. Currently, most optimized po- 
tential functions are derived based on decoy discrimination, which is a form 
of binary classification. Here we suggest a conceptual improvement that can 
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significantly improve the development of optimized potential functions. If we 
can measure the thermodynamic stabilities of all major representative proteins 
under identical experimental conditions {e.g., temperature, pH, salt concentra- 
tion, and osmolarity), we can attempt to develop potential functions with the 
objective of minimizing the regression errors of fitted energy values and mea- 
sured energy values. The resulting energy surface will then provide quantitative 
information about protein stabilities. However, the success of this strategy will 
depend on coordinated experimental efforts in protein thermodynamic mea- 
surements. The scale of such efforts may need to be similar to that of genome 
sequencing projects and structural genomics projects. 

4.7.4 Data dependency of knowledge-based potentials 

There are many directions to improve knowledge-based potential functions. Of- 
ten it is desirable to include additional descriptors in the energy functions to 
more accurately account for solvation, hydrogen bonding, backbone conforma- 
tion {e.g., (j) Slid '4' angles), and side chain entropies. Furthermore, potential 
functions with different descriptors and details may be needed for diffe rent tasks 
{e.g. backbone prediction vs structure refinement, ijRohl et all\200^ ). 

An important issue in both statistical potential and optimized potential is 
their dependency on the amount of available training data and possible bias in 
such data. For example, whether a knowledge-based potential derived from a 
bias dat a set is appli c able to a different class of p roteins is the topic of several 
studies l|Zhang et a/.U2004cHKhatun et a/.U2004|) . Z hang et al further stu dies 
the effect of database choice on statistical potential (jZhang et "^. l2004bl) . In 
addition, when the amount of data is limited, over-fitting is a real problem 
if too many descriptors are introduced in either of the two types of potential 
functions. For statistical potential, hierarchical hypothesis testing should help 
to decide whether additional terms is warranted. Fo r optimized pote ntial, cross- 
validation will help to uncover possible overfitting ijHu et a^J . l2004j) . 

4.8 Summary 

In this chapter, we discussed the general framework of developing knowledge- 
based potential functions in terms of molecular descriptors, functional form, 
and parameter calculations. We also discussed the underlying thermodynamic 
hypothesis of protein folding. With the assumption that frequently observed 
protein features in a database of structures correspond to low energy state, 
frequency of observed interactions can be converted to energy terms. We then 
described in details the models behind the Miyazawa- Jernigan contact potential, 
distance dependent potentials, and geometric potentials. We also discussed 
how to weight sample structures of varying degree of sequence similarity in 
the structural database. In the section of optimization method, we describe 
general geometric models for the problem of obtaining optimized knowledge- 
based potential functions, as well as methods for developing optimal linear and 
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nonlinear potential functions. This is followed by a brief discussion of several 
applications of the knowledge-based potential functions. Finally, we point out 
general limitations and possible improvements for the statistical and optimized 
potential functions. 



4.9 Further reading 

Anfinsen ' s ther modynamic hypothesis can be found in ( Anfinsen et all Il96lt 



lAnfinse More technical details of the Miyazawa-Jernigan contact po- 

tential are described in |Mivazawa and Jcrnigan, 1985, 1996). Distance depen- 
dent potential function wa s first proposed by Sippl in llSippill9 90). with furthe r 
development described in ||Lu an d Skolnic Sam udrala and Moultl ll998l). 

The development of gjeometric potentials c an be found in (Zhen g e i? a0.ll997at 

Carter Jr. et al l l200lllLi e^"aIll2003tlKrishnamoorthv and Tropshall2003HMcConkey et all 



2003^. The gas-phas e approximation of the reference state is discussed in 
jZhou and ZhouLl2002|) . Thomas and Dill offered insightful comments about the 
deficie ncy of knowledge-based statistical potential functions ijThomas and DilH 
Il996bl). The deve lopment of optimized linear p otential functions can be found 
in (jVendruscolo e t al. 2000b; Michclct ti et ali hQQl: To^HToD, |2Q0Q3) ■ The 
geometric view for designing optimized pote ntial function a nd the nonlinear 
potential function are based on the results in <)Hu et a/J . l2004l . 
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Figure 4.4: Geometric views of the inequality requirement for protein scoring function. 
Here we use a two-dimensional toy example for illustration, (a). In the first geometric view, 
the space of w; = {wi ,11)2) is divided into two half-spaces by an inequality requirement, 
represented as a hyperplane w ■ (cjv — c_d) + 6 < 0. The hyperplane, which is a line in R^, 
is defined by the normal vector (cjv — cd), and its distance b/\\cN ^ cd\\ from the origin. 
In this figure, this distance is set to 1.0. The normal vector is represented by a short line 
segment whose direction points away from the straight line. A feasible weight vector w is 
located in the half-space opposite to the direction of the normal vector (cat — c_d). With 
the given set of inequalities represented by the lines, any weight vector w located in the 
shaped polygon can satisfy all inequality requirement and provides a linear scoring function 
that has perfect discrimination, (b). A second geometric view of the inequality requirement 
for linear protein scoring function. The space R'^ of a; = {xi,X2), where x = (cat — c_d), is 
divided into two half-spaces by the hyperplane w ■ {cm — cd) + b < 0. Here the hyperplane 
is defined by the normal vector w and its distance 6/||ty|| from the origin. The origin 
corresponds to the native protein. All points {cm — cd} are located on one side of the 
hyperplane away from the origin, therefore satisfying the inequality requirement. That is, 
a linear scoring function w such as the one represented by the straight line in this figure 
can have perfect discrimination, (c). In the second toy problem, a set of inequalities are 
represented by a set of straight lines according to the first geometric view. A subset of the 
inequalities require that the weight vector w to be located in the shaded convex polygon 
on the left, but another subset of inequalities require that w to be located in the dashed 
convex polygon on the top. Since these two polygons do not intersect, there is no weight 
vector w that can satisfy all inequality requirements. That is, no linear scoring function 
can classify these decoys from native protein, (d). According to the second geometric 
view, no hyperplane can separate all points {cm — cd} from the origin. But a nonlinear 
curve formed by a mixture of Gaussian kernels can have perfect separation of all vectors 
{cat — cd} from the origin: It has perfect discrimination. 
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Figure 4.5: Energies evaluated by packing and distance dependent residue contact po- 
tential plotted against the RMSD to native structures for conformations in Park(Sd Levitt 
Decoy Set. 
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Native antibody interface Best scored patch 



(a) (b) 

Figure 4.6: Recognition of binding surface patch of Capri targets, (a) Boundary of 
alpha shape for a caryo protein. Each node represents a surface residue, and each edge 
represents the alpha edge between two surface residues. A candidate surface patch is 
generated by starting from a surface residue on the cargo protein, and following alpha 
edges on the boundary of the alpha shape by breadth-first search, until m residues are 
included, (b) Native interface and the surface patch with the best score on the antibody 
of the protein complex CAPRI Target T02. Only heavy chain (in red) and light chain (in 
blue) of the antibody are drawn. The antigen is omitted from this illustration for clarity. 
The best scored surface patch (in green) resembles the native interface (in yellow): 71% 
residues from this surface patch are indeed on the native binding interface. The residue in 
white is the starting residue used to generate this surface patch with the best score. 
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